#opendaylight-clustering log

16:06:59 <moizr> #startmeeting “clustering hackers”
16:06:59 <odl_meetbot> Meeting started Tue Nov 11 16:06:59 2014 UTC.  The chair is moizr. Information about MeetBot at http://ci.openstack.org/meetbot.html.
16:06:59 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:06:59 <odl_meetbot> The meeting name has been set to '_clustering_hackers_'
16:07:07 <moizr> #chair tbackman
16:07:07 <odl_meetbot> Warning: Nick not in channel: tbackman
16:07:07 <odl_meetbot> Current chairs: moizr tbackman
16:07:10 <tbachman> lol
16:07:12 <tbachman> tbachman
16:07:20 <tbachman> you can do tb followed by tab
16:07:23 <tbachman> it will tab out the name
16:07:25 <moizr> #chair tbachman
16:07:25 <odl_meetbot> Current chairs: moizr tbachman tbackman
16:07:29 <tbachman> thx :)
16:07:39 <tbachman> #topic agenda
16:08:01 <tbachman> #info bug list review
16:09:06 <tbachman> #info moizr says that the problems found during stable/helium update 1 are with transaction chains, which will be seen with openflow applications
16:10:09 <tbachman> #info moizr says the fixes for these are in, and the integration tests in the ODL infrastructure is working much better (down to about 5-6 failures)
16:10:30 <tbachman> #info the goal is to improve the testing for 3-node clustering, and  have this hardened in time for SR2
16:11:20 <tbachman> #info The SR2 date is 1/12/2015
16:12:18 <tbachman> #info moizr’s goal is to have an automated suite that can be built on.
16:12:36 <tbachman> #info colindixon says that we can ask the integration team to run their tests with clustering enabled
16:12:40 <tbachman> moizr: can you put it here?
16:12:41 <tbachman> thx
16:13:20 <tbachman> :)
16:13:50 <tbachman> #info pantelis asks when we can start pushing patches again
16:14:09 <tbachman> #info colindixon says we need to push the release and version bump patches before pushing updates to stable/helium
16:14:45 <moizr> #info Clustering openflow integration tests https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/
16:16:13 <tbachman> #info pantelis asks if there will be an email indicating that it’s okay to push patches
16:17:00 <colindixon> #link https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ the cluster integration tests
16:17:06 <tbachman> colindixon: thx!
16:17:10 <moizr> #info watch this patch to know when to merge to stable/helium
16:17:15 <moizr> #link https://git.opendaylight.org/gerrit/#/c/12711
16:17:20 <tbachman> #undo
16:17:20 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Link object at 0x2739a10>
16:17:36 <tbachman> #link https://git.opendaylight.org/gerrit/#/c/12711 patch that needs to merge before pushing additional patches to stable/helium
16:17:46 <tbachman> moizr: just like to put the comment with the link ;)
16:17:47 <tbachman> thx tho!
16:17:51 <moizr> thanks :)
16:19:10 <tbachman> #info team reviews bugzilla to cover the in-progress patches
16:21:34 <tbachman> #info The clustering test app wasn’t including odl-restconf, so there’s now a patch for this, and the deployment script has been changed to run the clustering test app so that anyone can run these tests.
16:25:00 <tbachman> #info BUG 2284 (at startup no leader is elected yet) has been fixed in gerrit 12215, but needs testing
16:26:27 <tbachman> #info BUG 2302 has been fixed by gerrit 12705
16:28:09 <tbachman> #info The exception in BUG 2320 may not be a real problem, or at least not something unique to clustering (caused by two apps trying to write the same thing)
16:28:44 <tbachman> #info BUG 2327 is being taken on by ttkacik
16:29:47 <tbachman> #info BUG 2335 is not a bug to clustering per se, but we do want/need to add a feature to address this
16:30:41 <tbachman> #info moizr hasn’t seen bug reports from those who are testing — encourages testers/users to err on the side of filing something that may not be a bug than not filing one at all
16:30:57 <tbachman> topic?
16:31:45 <tbachman> #topic 2-Node Deployment Design
16:31:49 <rexpugh> can you go full screen?
16:32:04 <tbachman> #info In Li, trying to go to an active/standby setup
16:32:24 <tbachman> #info For active/standby 2-node, you need a specific topology in order to have High Availability
16:32:50 <tbachman> #info one of the controllers will be the primary (configured or elected — tbd), which will be the leader of all the shards and master of all the devices on the network
16:33:18 <tbachman> #info There are cases where there is network partitioning where we need to be able to work with what devices we have or not manage the network
16:34:09 <tbachman> #info edwarnicke asks why other configurations would be precluded
16:34:49 <tbachman> #info markmozolewski2 says that this is for Li, and to do otherwise would impose other requirements like finer grained sharding
16:35:27 <tbachman> #info edwarnicke just wanted to make sure there was sufficient “architectural white space” to support other uses in the future
16:36:56 <tbachman> #info There are 3 major areas for changes: Raft sharding and leader election; post-healing leader with dynamic shard and cluster configuration; having a NB IP alias for the team so that apps can contact one controller in the team
16:37:59 <tbachman> #info Opting to provide hooks in code to influence leader election, allowing a different strategy for 2-node operation
16:38:50 <tbachman> #link https://git.opendaylight.org/gerrit/#/c/12588/ Gerrit that implements this hook
16:40:02 <tbachman> #info active/active cases are also under discussion, but not a goal for Li
16:41:31 <tbachman> #info Question on what the expected recovery time is for partitions
16:42:10 <tbachman> #info moizr says that things are broken up into small chunks (~2MB) and transferred. The recovery time is based on the last state and how much data is remaining to be synched.
16:42:56 <tbachman> #info For Data Center use cases, the recovery time needs to be short; can be longer for service providers (minutes)
16:44:25 <tbachman> #info dandrushko asks if there’s anything their team can contribute to clustering
16:44:33 <tbachman> #info moizr says the biggest need right now is testing
16:46:20 <tbachman> #info moizr says installing the following features should be sufficient: odl-dlux-all; odl-restconf-no-auth; odl-mdsal-clustering; and odl-openflowplugi-flow-services
16:46:48 <tbachman> #info dandrushko says they will try the integration test against their local environment
16:47:28 <tbachman> #info moizr recommends building against master, as it has the post-SR1 patches
16:48:06 <tbachman> #info question on persistence  — is this available w/o clustering? moizr says you need clustering for persistence
16:49:10 <tbachman> #info Alexander Bochkarev asks what the status of gerrit 12053?
16:49:38 <tbachman> #info dandrushko asks if there’s anything they can help with here
16:50:08 <tbachman> #info edwarnicke asks if it’s possible to configure clustering with a 1-node cluster?
16:50:20 <tbachman> #info moizr says yes, and this would give you persistence
16:50:59 <tbachman> #info dandrushko says this feature is unstable in the stable/helium release, and asks if it will be stable in SR1 or SR2
16:51:12 <tbachman> #info moizr says this will be fixed for SR2
16:52:20 <tbachman> #info moizr asks dandrushko to build and test this using master and see if that fixes their issues
16:54:08 <tbachman> #info Question on how many nodes can clustering be run on?
16:54:41 <tbachman> #info moizr says we’re testing on 3-nodes now, but there’s no limit
16:57:36 <tbachman> #endmeeting