============================================== #opendaylight-clustering: “clustering hackers” ============================================== Meeting started by moizr at 16:06:59 UTC. The full logs are available at http://meetings.opendaylight.org/opendaylight-clustering/2014/_clustering_hackers_/opendaylight-clustering-_clustering_hackers_.2014-11-11-16.06.log.html . Meeting summary --------------- * agenda (tbachman, 16:07:39) * bug list review (tbachman, 16:08:01) * moizr says that the problems found during stable/helium update 1 are with transaction chains, which will be seen with openflow applications (tbachman, 16:09:06) * moizr says the fixes for these are in, and the integration tests in the ODL infrastructure is working much better (down to about 5-6 failures) (tbachman, 16:10:09) * the goal is to improve the testing for 3-node clustering, and have this hardened in time for SR2 (tbachman, 16:10:30) * The SR2 date is 1/12/2015 (tbachman, 16:11:20) * moizr’s goal is to have an automated suite that can be built on. (tbachman, 16:12:18) * colindixon says that we can ask the integration team to run their tests with clustering enabled (tbachman, 16:12:36) * pantelis asks when we can start pushing patches again (tbachman, 16:13:50) * colindixon says we need to push the release and version bump patches before pushing updates to stable/helium (tbachman, 16:14:09) * Clustering openflow integration tests https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ (moizr, 16:14:45) * pantelis asks if there will be an email indicating that it’s okay to push patches (tbachman, 16:16:13) * LINK: https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ the cluster integration tests (colindixon, 16:17:00) * watch this patch to know when to merge to stable/helium (moizr, 16:17:10) * LINK: https://git.opendaylight.org/gerrit/#/c/12711 patch that needs to merge before pushing additional patches to stable/helium (tbachman, 16:17:36) * team reviews bugzilla to cover the in-progress patches (tbachman, 16:19:10) * The clustering test app wasn’t including odl-restconf, so there’s now a patch for this, and the deployment script has been changed to run the clustering test app so that anyone can run these tests. (tbachman, 16:21:34) * BUG 2284 (at startup no leader is elected yet) has been fixed in gerrit 12215, but needs testing (tbachman, 16:25:00) * BUG 2302 has been fixed by gerrit 12705 (tbachman, 16:26:27) * The exception in BUG 2320 may not be a real problem, or at least not something unique to clustering (caused by two apps trying to write the same thing) (tbachman, 16:28:09) * BUG 2327 is being taken on by ttkacik (tbachman, 16:28:44) * BUG 2335 is not a bug to clustering per se, but we do want/need to add a feature to address this (tbachman, 16:29:47) * moizr hasn’t seen bug reports from those who are testing — encourages testers/users to err on the side of filing something that may not be a bug than not filing one at all (tbachman, 16:30:41) * 2-Node Deployment Design (tbachman, 16:31:45) * In Li, trying to go to an active/standby setup (tbachman, 16:32:04) * For active/standby 2-node, you need a specific topology in order to have High Availability (tbachman, 16:32:24) * one of the controllers will be the primary (configured or elected — tbd), which will be the leader of all the shards and master of all the devices on the network (tbachman, 16:32:50) * There are cases where there is network partitioning where we need to be able to work with what devices we have or not manage the network (tbachman, 16:33:18) * edwarnicke asks why other configurations would be precluded (tbachman, 16:34:09) * markmozolewski2 says that this is for Li, and to do otherwise would impose other requirements like finer grained sharding (tbachman, 16:34:49) * edwarnicke just wanted to make sure there was sufficient “architectural white space” to support other uses in the future (tbachman, 16:35:27) * There are 3 major areas for changes: Raft sharding and leader election; post-healing leader with dynamic shard and cluster configuration; having a NB IP alias for the team so that apps can contact one controller in the team (tbachman, 16:36:56) * Opting to provide hooks in code to influence leader election, allowing a different strategy for 2-node operation (tbachman, 16:37:59) * LINK: https://git.opendaylight.org/gerrit/#/c/12588/ Gerrit that implements this hook (tbachman, 16:38:50) * active/active cases are also under discussion, but not a goal for Li (tbachman, 16:40:02) * Question on what the expected recovery time is for partitions (tbachman, 16:41:31) * moizr says that things are broken up into small chunks (~2MB) and transferred. The recovery time is based on the last state and how much data is remaining to be synched. (tbachman, 16:42:10) * For Data Center use cases, the recovery time needs to be short; can be longer for service providers (minutes) (tbachman, 16:42:56) * dandrushko asks if there’s anything their team can contribute to clustering (tbachman, 16:44:25) * moizr says the biggest need right now is testing (tbachman, 16:44:33) * moizr says installing the following features should be sufficient: odl-dlux-all; odl-restconf-no-auth; odl-mdsal-clustering; and odl-openflowplugi-flow-services (tbachman, 16:46:20) * dandrushko says they will try the integration test against their local environment (tbachman, 16:46:48) * moizr recommends building against master, as it has the post-SR1 patches (tbachman, 16:47:28) * question on persistence — is this available w/o clustering? moizr says you need clustering for persistence (tbachman, 16:48:06) * Alexander Bochkarev asks what the status of gerrit 12053? (tbachman, 16:49:10) * dandrushko asks if there’s anything they can help with here (tbachman, 16:49:38) * edwarnicke asks if it’s possible to configure clustering with a 1-node cluster? (tbachman, 16:50:08) * moizr says yes, and this would give you persistence (tbachman, 16:50:20) * dandrushko says this feature is unstable in the stable/helium release, and asks if it will be stable in SR1 or SR2 (tbachman, 16:50:59) * moizr says this will be fixed for SR2 (tbachman, 16:51:12) * moizr asks dandrushko to build and test this using master and see if that fixes their issues (tbachman, 16:52:20) * Question on how many nodes can clustering be run on? (tbachman, 16:54:08) * moizr says we’re testing on 3-nodes now, but there’s no limit (tbachman, 16:54:41) Meeting ended at 16:57:36 UTC. People present (lines said) --------------------------- * tbachman (64) * moizr (7) * odl_meetbot (7) * rexpugh (1) * colindixon (1) * tbackman (0) Generated by `MeetBot`_ 0.1.4