#opendaylight-clustering: “clustering hackers”

Meeting started by moizr at 16:06:59 UTC (full logs).

Meeting summary

agenda (tbachman, 16:07:39)
1. bug list review (tbachman, 16:08:01)
2. moizr says that the problems found during stable/helium update 1 are with transaction chains, which will be seen with openflow applications (tbachman, 16:09:06)
3. moizr says the fixes for these are in, and the integration tests in the ODL infrastructure is working much better (down to about 5-6 failures) (tbachman, 16:10:09)
4. the goal is to improve the testing for 3-node clustering, and have this hardened in time for SR2 (tbachman, 16:10:30)
5. The SR2 date is 1/12/2015 (tbachman, 16:11:20)
6. moizr’s goal is to have an automated suite that can be built on. (tbachman, 16:12:18)
7. colindixon says that we can ask the integration team to run their tests with clustering enabled (tbachman, 16:12:36)
8. pantelis asks when we can start pushing patches again (tbachman, 16:13:50)
9. colindixon says we need to push the release and version bump patches before pushing updates to stable/helium (tbachman, 16:14:09)
10. Clustering openflow integration tests https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ (moizr, 16:14:45)
11. pantelis asks if there will be an email indicating that it’s okay to push patches (tbachman, 16:16:13)
12. https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/ the cluster integration tests (colindixon, 16:17:00)
13. watch this patch to know when to merge to stable/helium (moizr, 16:17:10)
14. https://git.opendaylight.org/gerrit/#/c/12711 patch that needs to merge before pushing additional patches to stable/helium (tbachman, 16:17:36)
15. team reviews bugzilla to cover the in-progress patches (tbachman, 16:19:10)
16. The clustering test app wasn’t including odl-restconf, so there’s now a patch for this, and the deployment script has been changed to run the clustering test app so that anyone can run these tests. (tbachman, 16:21:34)
17. BUG 2284 (at startup no leader is elected yet) has been fixed in gerrit 12215, but needs testing (tbachman, 16:25:00)
18. BUG 2302 has been fixed by gerrit 12705 (tbachman, 16:26:27)
19. The exception in BUG 2320 may not be a real problem, or at least not something unique to clustering (caused by two apps trying to write the same thing) (tbachman, 16:28:09)
20. BUG 2327 is being taken on by ttkacik (tbachman, 16:28:44)
21. BUG 2335 is not a bug to clustering per se, but we do want/need to add a feature to address this (tbachman, 16:29:47)
22. moizr hasn’t seen bug reports from those who are testing — encourages testers/users to err on the side of filing something that may not be a bug than not filing one at all (tbachman, 16:30:41)
2-Node Deployment Design (tbachman, 16:31:45)
1. In Li, trying to go to an active/standby setup (tbachman, 16:32:04)
2. For active/standby 2-node, you need a specific topology in order to have High Availability (tbachman, 16:32:24)
3. one of the controllers will be the primary (configured or elected — tbd), which will be the leader of all the shards and master of all the devices on the network (tbachman, 16:32:50)
4. There are cases where there is network partitioning where we need to be able to work with what devices we have or not manage the network (tbachman, 16:33:18)
5. edwarnicke asks why other configurations would be precluded (tbachman, 16:34:09)
6. markmozolewski2 says that this is for Li, and to do otherwise would impose other requirements like finer grained sharding (tbachman, 16:34:49)
7. edwarnicke just wanted to make sure there was sufficient “architectural white space” to support other uses in the future (tbachman, 16:35:27)
8. There are 3 major areas for changes: Raft sharding and leader election; post-healing leader with dynamic shard and cluster configuration; having a NB IP alias for the team so that apps can contact one controller in the team (tbachman, 16:36:56)
9. Opting to provide hooks in code to influence leader election, allowing a different strategy for 2-node operation (tbachman, 16:37:59)
10. https://git.opendaylight.org/gerrit/#/c/12588/ Gerrit that implements this hook (tbachman, 16:38:50)
11. active/active cases are also under discussion, but not a goal for Li (tbachman, 16:40:02)
12. Question on what the expected recovery time is for partitions (tbachman, 16:41:31)
13. moizr says that things are broken up into small chunks (~2MB) and transferred. The recovery time is based on the last state and how much data is remaining to be synched. (tbachman, 16:42:10)
14. For Data Center use cases, the recovery time needs to be short; can be longer for service providers (minutes) (tbachman, 16:42:56)
15. dandrushko asks if there’s anything their team can contribute to clustering (tbachman, 16:44:25)
16. moizr says the biggest need right now is testing (tbachman, 16:44:33)
17. moizr says installing the following features should be sufficient: odl-dlux-all; odl-restconf-no-auth; odl-mdsal-clustering; and odl-openflowplugi-flow-services (tbachman, 16:46:20)
18. dandrushko says they will try the integration test against their local environment (tbachman, 16:46:48)
19. moizr recommends building against master, as it has the post-SR1 patches (tbachman, 16:47:28)
20. question on persistence — is this available w/o clustering? moizr says you need clustering for persistence (tbachman, 16:48:06)
21. Alexander Bochkarev asks what the status of gerrit 12053? (tbachman, 16:49:10)
22. dandrushko asks if there’s anything they can help with here (tbachman, 16:49:38)
23. edwarnicke asks if it’s possible to configure clustering with a 1-node cluster? (tbachman, 16:50:08)
24. moizr says yes, and this would give you persistence (tbachman, 16:50:20)
25. dandrushko says this feature is unstable in the stable/helium release, and asks if it will be stable in SR1 or SR2 (tbachman, 16:50:59)
26. moizr says this will be fixed for SR2 (tbachman, 16:51:12)
27. moizr asks dandrushko to build and test this using master and see if that fixes their issues (tbachman, 16:52:20)
28. Question on how many nodes can clustering be run on? (tbachman, 16:54:08)
29. moizr says we’re testing on 3-nodes now, but there’s no limit (tbachman, 16:54:41)

Meeting ended at 16:57:36 UTC (full logs).

Action items

(none)

People present (lines said)

tbachman (64)
moizr (7)
odl_meetbot (7)
rexpugh (1)
colindixon (1)
tbackman (0)

Generated by MeetBot 0.1.4.