16:06:59 <moizr> #startmeeting “clustering hackers” 16:06:59 <odl_meetbot> Meeting started Tue Nov 11 16:06:59 2014 UTC. The chair is moizr. Information about MeetBot at http://ci.openstack.org/meetbot.html. 16:06:59 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:06:59 <odl_meetbot> The meeting name has been set to '_clustering_hackers_' 16:07:07 <moizr> #chair tbackman 16:07:07 <odl_meetbot> Warning: Nick not in channel: tbackman 16:07:07 <odl_meetbot> Current chairs: moizr tbackman 16:07:10 <tbachman> lol 16:07:12 <tbachman> tbachman 16:07:20 <tbachman> you can do tb followed by tab 16:07:23 <tbachman> it will tab out the name 16:07:25 <moizr> #chair tbachman 16:07:25 <odl_meetbot> Current chairs: moizr tbachman tbackman 16:07:29 <tbachman> thx :) 16:07:39 <tbachman> #topic agenda 16:08:01 <tbachman> #info bug list review 16:09:06 <tbachman> #info moizr says that the problems found during stable/helium update 1 are with transaction chains, which will be seen with openflow applications 16:10:09 <tbachman> #info moizr says the fixes for these are in, and the integration tests in the ODL infrastructure is working much better (down to about 5-6 failures) 16:10:30 <tbachman> #info the goal is to improve the testing for 3-node clustering, and have this hardened in time for SR2 16:11:20 <tbachman> #info The SR2 date is 1/12/2015 16:12:18 <tbachman> #info moizr’s goal is to have an automated suite that can be built on. 16:12:36 <tbachman> #info colindixon says that we can ask the integration team to run their tests with clustering enabled 16:12:40 <tbachman> moizr: can you put it here? 16:12:41 <tbachman> thx 16:13:20 <tbachman> :) 16:13:50 <tbachman> #info pantelis asks when we can start pushing patches again 16:14:09 <tbachman> #info colindixon says we need to push the release and version bump patches before pushing updates to stable/helium 16:14:45 <moizr> #info Clustering openflow integration tests https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ 16:16:13 <tbachman> #info pantelis asks if there will be an email indicating that it’s okay to push patches 16:17:00 <colindixon> #link https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ the cluster integration tests 16:17:06 <tbachman> colindixon: thx! 16:17:10 <moizr> #info watch this patch to know when to merge to stable/helium 16:17:15 <moizr> #link https://git.opendaylight.org/gerrit/#/c/12711 16:17:20 <tbachman> #undo 16:17:20 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Link object at 0x2739a10> 16:17:36 <tbachman> #link https://git.opendaylight.org/gerrit/#/c/12711 patch that needs to merge before pushing additional patches to stable/helium 16:17:46 <tbachman> moizr: just like to put the comment with the link ;) 16:17:47 <tbachman> thx tho! 16:17:51 <moizr> thanks :) 16:19:10 <tbachman> #info team reviews bugzilla to cover the in-progress patches 16:21:34 <tbachman> #info The clustering test app wasn’t including odl-restconf, so there’s now a patch for this, and the deployment script has been changed to run the clustering test app so that anyone can run these tests. 16:25:00 <tbachman> #info BUG 2284 (at startup no leader is elected yet) has been fixed in gerrit 12215, but needs testing 16:26:27 <tbachman> #info BUG 2302 has been fixed by gerrit 12705 16:28:09 <tbachman> #info The exception in BUG 2320 may not be a real problem, or at least not something unique to clustering (caused by two apps trying to write the same thing) 16:28:44 <tbachman> #info BUG 2327 is being taken on by ttkacik 16:29:47 <tbachman> #info BUG 2335 is not a bug to clustering per se, but we do want/need to add a feature to address this 16:30:41 <tbachman> #info moizr hasn’t seen bug reports from those who are testing — encourages testers/users to err on the side of filing something that may not be a bug than not filing one at all 16:30:57 <tbachman> topic? 16:31:45 <tbachman> #topic 2-Node Deployment Design 16:31:49 <rexpugh> can you go full screen? 16:32:04 <tbachman> #info In Li, trying to go to an active/standby setup 16:32:24 <tbachman> #info For active/standby 2-node, you need a specific topology in order to have High Availability 16:32:50 <tbachman> #info one of the controllers will be the primary (configured or elected — tbd), which will be the leader of all the shards and master of all the devices on the network 16:33:18 <tbachman> #info There are cases where there is network partitioning where we need to be able to work with what devices we have or not manage the network 16:34:09 <tbachman> #info edwarnicke asks why other configurations would be precluded 16:34:49 <tbachman> #info markmozolewski2 says that this is for Li, and to do otherwise would impose other requirements like finer grained sharding 16:35:27 <tbachman> #info edwarnicke just wanted to make sure there was sufficient “architectural white space” to support other uses in the future 16:36:56 <tbachman> #info There are 3 major areas for changes: Raft sharding and leader election; post-healing leader with dynamic shard and cluster configuration; having a NB IP alias for the team so that apps can contact one controller in the team 16:37:59 <tbachman> #info Opting to provide hooks in code to influence leader election, allowing a different strategy for 2-node operation 16:38:50 <tbachman> #link https://git.opendaylight.org/gerrit/#/c/12588/ Gerrit that implements this hook 16:40:02 <tbachman> #info active/active cases are also under discussion, but not a goal for Li 16:41:31 <tbachman> #info Question on what the expected recovery time is for partitions 16:42:10 <tbachman> #info moizr says that things are broken up into small chunks (~2MB) and transferred. The recovery time is based on the last state and how much data is remaining to be synched. 16:42:56 <tbachman> #info For Data Center use cases, the recovery time needs to be short; can be longer for service providers (minutes) 16:44:25 <tbachman> #info dandrushko asks if there’s anything their team can contribute to clustering 16:44:33 <tbachman> #info moizr says the biggest need right now is testing 16:46:20 <tbachman> #info moizr says installing the following features should be sufficient: odl-dlux-all; odl-restconf-no-auth; odl-mdsal-clustering; and odl-openflowplugi-flow-services 16:46:48 <tbachman> #info dandrushko says they will try the integration test against their local environment 16:47:28 <tbachman> #info moizr recommends building against master, as it has the post-SR1 patches 16:48:06 <tbachman> #info question on persistence — is this available w/o clustering? moizr says you need clustering for persistence 16:49:10 <tbachman> #info Alexander Bochkarev asks what the status of gerrit 12053? 16:49:38 <tbachman> #info dandrushko asks if there’s anything they can help with here 16:50:08 <tbachman> #info edwarnicke asks if it’s possible to configure clustering with a 1-node cluster? 16:50:20 <tbachman> #info moizr says yes, and this would give you persistence 16:50:59 <tbachman> #info dandrushko says this feature is unstable in the stable/helium release, and asks if it will be stable in SR1 or SR2 16:51:12 <tbachman> #info moizr says this will be fixed for SR2 16:52:20 <tbachman> #info moizr asks dandrushko to build and test this using master and see if that fixes their issues 16:54:08 <tbachman> #info Question on how many nodes can clustering be run on? 16:54:41 <tbachman> #info moizr says we’re testing on 3-nodes now, but there’s no limit 16:57:36 <tbachman> #endmeeting