#opendaylight-clustering: “clustering hackers”
Meeting started by moizr at 16:06:59 UTC
(full logs).
Meeting summary
- agenda (tbachman, 16:07:39)
- bug list review (tbachman,
16:08:01)
- moizr says that the problems found during
stable/helium update 1 are with transaction chains, which will be
seen with openflow applications (tbachman,
16:09:06)
- moizr says the fixes for these are in, and the
integration tests in the ODL infrastructure is working much better
(down to about 5-6 failures) (tbachman,
16:10:09)
- the goal is to improve the testing for 3-node
clustering, and have this hardened in time for SR2 (tbachman,
16:10:30)
- The SR2 date is 1/12/2015 (tbachman,
16:11:20)
- moizr’s goal is to have an automated suite that
can be built on. (tbachman,
16:12:18)
- colindixon says that we can ask the integration
team to run their tests with clustering enabled (tbachman,
16:12:36)
- pantelis asks when we can start pushing patches
again (tbachman,
16:13:50)
- colindixon says we need to push the release and
version bump patches before pushing updates to stable/helium
(tbachman,
16:14:09)
- Clustering openflow integration tests
https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/
(moizr,
16:14:45)
- pantelis asks if there will be an email
indicating that it’s okay to push patches (tbachman,
16:16:13)
- https://jenkins.opendaylight.org/integration/job/integration-master-csit-cluster-min/
the cluster integration tests (colindixon,
16:17:00)
- watch this patch to know when to merge to
stable/helium (moizr,
16:17:10)
- https://git.opendaylight.org/gerrit/#/c/12711
patch that needs to merge before pushing additional patches to
stable/helium (tbachman,
16:17:36)
- team reviews bugzilla to cover the in-progress
patches (tbachman,
16:19:10)
- The clustering test app wasn’t including
odl-restconf, so there’s now a patch for this, and the deployment
script has been changed to run the clustering test app so that
anyone can run these tests. (tbachman,
16:21:34)
- BUG 2284 (at startup no leader is elected yet)
has been fixed in gerrit 12215, but needs testing (tbachman,
16:25:00)
- BUG 2302 has been fixed by gerrit 12705
(tbachman,
16:26:27)
- The exception in BUG 2320 may not be a real
problem, or at least not something unique to clustering (caused by
two apps trying to write the same thing) (tbachman,
16:28:09)
- BUG 2327 is being taken on by ttkacik
(tbachman,
16:28:44)
- BUG 2335 is not a bug to clustering per se, but
we do want/need to add a feature to address this (tbachman,
16:29:47)
- moizr hasn’t seen bug reports from those who
are testing — encourages testers/users to err on the side of filing
something that may not be a bug than not filing one at all
(tbachman,
16:30:41)
- 2-Node Deployment Design (tbachman, 16:31:45)
- In Li, trying to go to an active/standby
setup (tbachman,
16:32:04)
- For active/standby 2-node, you need a specific
topology in order to have High Availability (tbachman,
16:32:24)
- one of the controllers will be the primary
(configured or elected — tbd), which will be the leader of all the
shards and master of all the devices on the network (tbachman,
16:32:50)
- There are cases where there is network
partitioning where we need to be able to work with what devices we
have or not manage the network (tbachman,
16:33:18)
- edwarnicke asks why other configurations would
be precluded (tbachman,
16:34:09)
- markmozolewski2 says that this is for Li, and
to do otherwise would impose other requirements like finer grained
sharding (tbachman,
16:34:49)
- edwarnicke just wanted to make sure there was
sufficient “architectural white space” to support other uses in the
future (tbachman,
16:35:27)
- There are 3 major areas for changes: Raft
sharding and leader election; post-healing leader with dynamic shard
and cluster configuration; having a NB IP alias for the team so that
apps can contact one controller in the team (tbachman,
16:36:56)
- Opting to provide hooks in code to influence
leader election, allowing a different strategy for 2-node
operation (tbachman,
16:37:59)
- https://git.opendaylight.org/gerrit/#/c/12588/
Gerrit that implements this hook (tbachman,
16:38:50)
- active/active cases are also under discussion,
but not a goal for Li (tbachman,
16:40:02)
- Question on what the expected recovery time is
for partitions (tbachman,
16:41:31)
- moizr says that things are broken up into small
chunks (~2MB) and transferred. The recovery time is based on the
last state and how much data is remaining to be synched.
(tbachman,
16:42:10)
- For Data Center use cases, the recovery time
needs to be short; can be longer for service providers
(minutes) (tbachman,
16:42:56)
- dandrushko asks if there’s anything their team
can contribute to clustering (tbachman,
16:44:25)
- moizr says the biggest need right now is
testing (tbachman,
16:44:33)
- moizr says installing the following features
should be sufficient: odl-dlux-all; odl-restconf-no-auth;
odl-mdsal-clustering; and odl-openflowplugi-flow-services
(tbachman,
16:46:20)
- dandrushko says they will try the integration
test against their local environment (tbachman,
16:46:48)
- moizr recommends building against master, as it
has the post-SR1 patches (tbachman,
16:47:28)
- question on persistence — is this available
w/o clustering? moizr says you need clustering for
persistence (tbachman,
16:48:06)
- Alexander Bochkarev asks what the status of
gerrit 12053? (tbachman,
16:49:10)
- dandrushko asks if there’s anything they can
help with here (tbachman,
16:49:38)
- edwarnicke asks if it’s possible to configure
clustering with a 1-node cluster? (tbachman,
16:50:08)
- moizr says yes, and this would give you
persistence (tbachman,
16:50:20)
- dandrushko says this feature is unstable in the
stable/helium release, and asks if it will be stable in SR1 or
SR2 (tbachman,
16:50:59)
- moizr says this will be fixed for SR2
(tbachman,
16:51:12)
- moizr asks dandrushko to build and test this
using master and see if that fixes their issues (tbachman,
16:52:20)
- Question on how many nodes can clustering be
run on? (tbachman,
16:54:08)
- moizr says we’re testing on 3-nodes now, but
there’s no limit (tbachman,
16:54:41)
Meeting ended at 16:57:36 UTC
(full logs).
Action items
- (none)
People present (lines said)
- tbachman (64)
- moizr (7)
- odl_meetbot (7)
- rexpugh (1)
- colindixon (1)
- tbackman (0)
Generated by MeetBot 0.1.4.