#opendaylight-meeting: odl-clustering

Meeting started by tbachman at 16:37:12 UTC (full logs).

Meeting summary

  1. agenda (tbachman, 16:37:53)
    1. technical aspects: strengths, weaknesses; where we are going (deployments/requirements, stable helium, lithium, long-term) (tbachman, 16:39:56)
    2. team-projecect aspects: coding, integration, testing (application, longevity) (tbachman, 16:40:18)
    3. alagalah asks why clustering is a feature as opposed to a core component of the controller (tbachman, 16:40:52)
    4. jmedved says from a historical perspective, the SAL is there for applications which can hide the locality of where the applications are; clustering would be solved somewhere in the SAL at some point later (tbachman, 16:42:13)
    5. after presentations on what was done with APIC and akka, those seemed to be good concepts which should be adopted, and these could be used inside the MD-SAL (tbachman, 16:44:21)
    6. so akka provides the messaging for the message bus (tbachman, 16:44:47)
    7. moiz notes we should probably start with reqiurements (tbachman, 16:46:26)
    8. https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit (tbachman, 16:46:36)

  2. requirements (tbachman, 16:47:58)
    1. GBP draft scaling requirements: #link https://wiki.opendaylight.org/view/Group_Policy:Scaling (alagalah, 16:48:23)
    2. https://wiki.opendaylight.org/view/Group_Policy:Scaling <= wiki page desribing requirements for GBP (tbachman, 16:49:07)
    3. md-sal is a framework for building applications (tbachman, 16:50:47)
    4. it has RPC, data, and notifications (tbachman, 16:50:54)
    5. rovarga points out that data is really data + notifications (tbachman, 16:51:17)
    6. for clustering, when making an RPC and the instance is on a different controller, that call has to be routed to the appropriate controller (tbachman, 16:52:10)
    7. the clustering arch maintains a registry of all the service calls (tbachman, 16:52:45)
    8. for example, instance 1, service foo is avilable, instance 2, service foo is available (tbachman, 16:53:12)
    9. gossip is used as an eventually consistent protocol to distribute this registry (tbachman, 16:53:27)
    10. as a concrete example, if the service is the openflowplugin, and there is a method, add flow (tbachman, 16:54:31)
    11. where switch 1 is on instance 1, and switch 2 is on instance 2 (tbachman, 16:54:49)
    12. so the registry would show it can provide addflow for switch 1 on instance 1 (tbachman, 16:55:13)
    13. is there a mechanism to ask what services are provided? (tbachman, 16:55:41)
    14. we don’t have that but could provide it (tbachman, 16:55:47)
    15. there is an optimization where if the RPC is local, it doesn’t hit this path at all (i.e. just makes the local call directly w/o any translation) (tbachman, 16:56:41)
    16. from an architectural perspective, clustering is just a layer of view (tbachman, 16:58:42)
    17. when designing and application, you typicaly use the binding aware part, but internally, whatever’s stored in the data store is in the normalized format (tbachman, 17:03:35)
    18. The binding aware data broker talks to the dom data broker, which works on the DOM store (tbachman, 17:04:09)
    19. the DOM store has two implementations: an in-memory store and a clustered data store (tbachman, 17:04:55)
    20. the current implementation uses the instance identifier, which contains the namespace of the module, to implement a strategy to create shards based on the namespace (tbachman, 17:06:29)
    21. users would write code for their sharding strategy (tbachman, 17:07:29)
    22. the operational and configuration data sores are already separated (tbachman, 17:08:10)
    23. folks commented that the operational data could be kept in-memory and the configuration data could be persisted (tbachman, 17:11:44)
    24. everything stored in clustered data store is journaled and snapshots are created (tbachman, 17:13:32)
    25. This allows cluster restarts to have their configuration recreated (tbachman, 17:14:03)
    26. moiz says that we still have to perform a test of getting the data from disk or getting it from the leader (tbachman, 17:20:43)
    27. RAFT is being used, which is a consensus algorithm (tbachman, 17:25:56)
    28. instead of inventing a new algorithm, RAFT was selected (tbachman, 17:26:10)
    29. moiz presents logs of a leader and a follower (tbachman, 17:26:20)
    30. there’s a term in the LOG, which is an election term, indicating that at this point, this was the leader (tbachman, 17:26:36)
    31. description of RAFT algorithm ensues :) (tbachman, 17:27:32)
    32. moiz notes that configurable persistence is something that will be addressed (tbachman, 17:32:17)
    33. moiz says that each shard has it’s own in-memory data store (tbachman, 17:32:31)
    34. so the clustered data store uses the in-memory data store (tbachman, 17:32:56)
    35. the in-memory data store is the state of the shard (tbachman, 17:34:03)
    36. the part that takes care of the persistence is a layer around the in-memory data store which ensures it persists and recovers from disk (using akka) (tbachman, 17:35:26)
    37. To recover, it starts from a snapshot and replays the journal, and if no snapshot is available, it starts from the beginning of the journal (tbachman, 17:36:17)
    38. snapshots are made every 20k journal entries, and is globally configurable (tbachman, 17:36:29)
    39. regXboi asks if there’s a way for snapshotting on a time-scheduled basis (tbachman, 17:37:11)
    40. moiz says it’s not done today, but akka persistence supports this (tbachman, 17:37:25)
    41. regXboi says it would be in addition, not an alternative (tbachman, 17:37:43)
    42. raghu67 notes the journal entries are also persisted, and the snapshot is only there to speed up recovery (tbachman, 17:38:06)
    43. regXboi says that anything that can be persisted can be corrupted (tbachman, 17:38:18)
    44. rovarga says a snapshot is a database check point (tbachman, 17:38:36)
    45. those concerned could do a full data reconciliation of the shard (tbachman, 17:38:57)
    46. alagalah asks if there’s anything like checksums for journal entries (tbachman, 17:40:08)
    47. moiz says we don’t currently implement this (tbachman, 17:40:18)
    48. raghu67 says it’s level DB implementation, so there may be something (tbachman, 17:40:32)
    49. How do we detect journal / snapshot entry corruption. (alagalah, 17:41:00)
    50. rovarga says that sharding strategy is not just how you carve up your data, but also reconciliation guarantees (tbachman, 17:42:55)
    51. also how many replicas you keep, how paranoid you are, whether you persist at all, etc. (rovarga, 17:43:46)
    52. rovarga says with checkpoints you can do some sanity check, but these are trade-offs on paranoia/performance trade-off (tbachman, 17:45:15)
    53. discussion was also around due to variability in transaction size in the journal, whether SIZE should be a snapshotting criteria, ie rather than 20k entries, if it reaches xxxx MB, snapshot. (alagalah, 17:46:24)
    54. persistence has sub-topics: snapshotting, journaling (tbachman, 17:46:47)

  3. replication (tbachman, 17:47:37)
    1. https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit <= google doc capturing some elements from the meeting (tbachman, 17:49:23)
    2. regXboi says that since every shard has a single leader, there is no multi-master scenario (tbachman, 17:51:06)
    3. moiz says that with the MD-SAL data store, we need to have a single master (tbachman, 17:51:29)
    4. regXboi says there are some interesting usage scenarios that a single-master can’t touch (tbachman, 17:51:44)
    5. multi-master wasn’t addressed for clustering for helium, as it’s more complex (tbachman, 17:52:07)
    6. regXboi says that once you achieve geographic diversity, you don’t care what the sharding strategy is, and multi-master becomes an issue (tbachman, 17:53:35)
    7. moiz says there are use cases for multi-master, but it’s harder to do (tbachman, 17:53:51)
    8. moiz asks what are the types of things applications will want to do (e.g. cluster aware apps?) (tbachman, 17:54:18)
    9. moiz says that multi master and transactions is questionable (tbachman, 17:55:34)
    10. moiz says the current in-memory data store has a single commiter, so we may have to move away from this implementation of the in-memory data store to support multi-master (tbachman, 17:56:43)
    11. to improve performance, the shards have to be located with the shard leaders (tbachman, 18:10:45)
    12. this begs the question of needing notifications when shard leaders change (tbachman, 18:11:12)
    13. rovarga says that tuning requires going to the next layer of writing additional code to optimize (tbachman, 18:13:13)
    14. as an example, using openflowplugin with switches, the applications may care about where the shard leader is and move the leader if needed (tbachman, 18:14:41)
    15. each switch in the network can have it’s own shard, which is colocated with the master instance that manages that switch (tbachman, 18:18:14)
    16. another approach is to have a shard contain the switches that the instance owns (tbachman, 18:19:47)
    17. regXboi asks how do you define that an openflow switch is colocated with a controller instance (tbachman, 18:21:15)
    18. this is logical control colocation, not physical (tbachman, 18:21:40)
    19. moiz continues with 3 node replication example (tbachman, 18:22:35)
    20. the replicas go through an election process (tbachman, 18:22:44)
    21. as soon as shard comes up, it’s a follower, and waits to be contacted by a leader for some period of time (tbachman, 18:23:12)
    22. after the timeout it becomes a candidate and seeks votes (tbachman, 18:23:31)
    23. once the candidate receives a majority of votes, it becomes a leader and sends heartbeats to all the other nodes (tbachman, 18:23:53)
    24. the number of nodes is defined in configuration, so the node knows how many votes are needed for a majority (tbachman, 18:24:15)
    25. this means you can’t have a 3-node cluster with only one node coming up and have it become the leader (tbachman, 18:25:37)
    26. all the transaction requests are forwarded to the leader (tbachman, 18:25:50)
    27. An example case is provided with a single leader and two followers (tbachman, 18:26:14)
    28. When a commit happens, the first thing the leader does is write to the joural (tbachman, 18:27:14)
    29. at the same time, replicas are sent to the followers (tbachman, 18:27:39)
    30. the followers then write them to their local journals (tbachman, 18:28:00)
    31. each follower reports back after writing to it’s journal, and the leader waits for the commit index to indicate that the followers have completed (tbachman, 18:28:45)
    32. at this point, the leader’s in-memory data store can be updated (tbachman, 18:29:05)
    33. the current in-memory data store requires that a current transaction is completed before another transaction can be submitted (only one transaction in can-commit and pre-commit states at a time) (tbachman, 18:30:27)
    34. rovarga says that within a shard, you want serializable consistency (tbachman, 18:32:21)
    35. each follower creates a new local transaction to commit the replica to their local in-memory data store (tbachman, 18:33:35)
    36. well... causal consistency is probably good enough, but IMDS does serializable simply because extracting/analyzing potential parallelism has comparable performance cost as just doing the actual work (rovarga, 18:34:26)
    37. you need to have an event when becoming a leader to be able to catch up (tbachman, 18:35:54)
    38. there is no consistency guarantee across shard boundaries (tbachman, 18:45:28)
    39. If you want notifications from two different sub-trees, they would have to be in the same shard (tbachman, 18:46:41)
    40. within a shard, the data change notifications provide a serialized view of the changes (tbachman, 18:47:11)
    41. ghall voices concern about application writers having to know this level of complexity, and wonders if this can be managed using layering (tbachman, 18:52:46)
    42. raghu67 notes that we can support this, but each layer becomes domain specific (tbachman, 18:56:02)
    43. the group is trying to determine the notifications, etc. required of just the data store itself (tbachman, 18:56:23)
    44. and layers can be placed above this that can simplify things for application developers (tbachman, 18:56:42)
    45. jmedved says that remote notifications might be needed (tbachman, 18:57:46)
    46. raghu67 says we could use subscription, and deliver the notifications based on where it was registered (tbachman, 19:02:25)
    47. we’re missing a notification when a follower becomes a leader (bug/add) (tbachman, 19:04:58)

  4. notifications (tbachman, 20:09:22)
    1. moiz asks when we register ourselves to the consumer, do we need to identify ourselves (tbachman, 20:09:53)
    2. and whether this is an API enhancement (tbachman, 20:10:20)
    3. rovarga says you can just do QName, and that says give all notifications (tbachman, 20:10:37)
    4. you can also have something more flexible, at which point the question becomes are we going to declare it or just define a filter (tbachman, 20:10:59)
    5. moiz says how would you like to get notifications only when they’re local, and then for all notifications (tbachman, 20:11:37)
    6. registerChangeListener(scope, identifier, listener) is what we have currently (tbachman, 20:12:48)
    7. do we enhance this API, to be (scope, only me, identifier, listener)? (tbachman, 20:13:17)
    8. ttkacik says you could have two data brokers: local, and super shard (tbachman, 20:14:08)
    9. the case that we’re trying to support with this is where the listener could be anywhere in the cluster (tbachman, 20:15:23)
    10. rovarga proposes an interface DataChangeListener that has an onDataChanged() method, and have other interfaces that extend this, such as NotifyOneOnly interfac that implements a getIdentity() method (tbachman, 20:26:18)
    11. regXboi +1 to that idea (regXboi, 20:27:16)
    12. this makes applications be cluster aware (tbachman, 20:27:24)
    13. this also allows code that works on one node to also work on multiple nodes (tbachman, 20:28:06)
    14. for cluster-aware applications, we block unless we’re the leader (tbachman, 20:36:22)
    15. ghall asks if the model has to be a DAG. Answer is yes (tbachman, 20:59:23)
    16. ghall asks if there’s a way to write code once, and regardless of the sharding scheme of the leafrefs, the notifications will get back to the listener (tbachman, 21:00:18)
    17. rovarga says we are honoring all explicitly stated references (tbachman, 21:01:24)

  5. logistics and planning (tbachman, 21:08:26)
    1. tpantelis asked if we need a sync up call (tbachman, 21:08:36)
    2. jmedved agrees we need one (tbachman, 21:08:42)
    3. discussion on resources — HP, Brocade, Cisco all looking to contribute resources (tbachman, 21:09:00)
    4. plan is to leverage the monday 8am PST MD-SAL meetings for covering clustering (tbachman, 21:11:20)
    5. https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Proposed Lithium simultaneous release plan (tbachman, 21:25:40)
    6. hackers/design meetings are monday morning 8am PST (tbachman, 21:25:49)
    7. possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? (tbachman, 21:26:05)
    8. webex meetings (tbachman, 21:26:20)
    9. IRC channel — team may set one up (tbachman, 21:26:43)
    10. will put design on the wiki (tbachman, 21:26:59)
    11. Possibly review this on the TWS calls (tbachman, 21:28:15)

  6. lithium requirements (tbachman, 21:29:32)
    1. hardening and performance is #1 (tbachman, 21:33:37)
    2. details for hardening and performance: use streaming of NormalizedNode; configurable persistence; don’t serialize/stream NormalizedNode when message is local (tbachman, 21:35:58)
    3. test bed requirements: 1 setup for 1 node-integration tests, 5 5-node cluster for testers, 1 5-node cluster for longevity tests for 1 day, 1 5-node cluster for longevity tests for 1 week, and 1 5-node cluster for longevity 1 month tests (tbachman, 21:46:29)
    4. other items: programmatic sharding and team config (tbachman, 21:47:07)
    5. other items: updates to data change deliveries (“me only”, “all listeners”) (tbachman, 21:49:25)
    6. other items: notifications (tbachman, 21:49:32)
    7. other items: finer grained sharding (tbachman, 21:50:08)
    8. other items: data broker for clustered data store (tbachman, 21:50:16)
    9. performance numbers: for GBP Unfiied Communications, 240 flows/second, 100k endpoints, (tbachman, 21:54:23)
    10. performance numbers for GBP NFV: 10M endpoints (tbachman, 21:57:36)
    11. ACTION: alagalah/tbachman to test use case numbers in data store, and report memory usage (tbachman, 22:02:21)
    12. ACTION: clustering group to ask community for performance characteristics they’re looking for (tbachman, 22:03:06)
    13. maybe include reference configurations on the wiki (tbachman, 22:05:49)
    14. group may schedule some hackathons for clustering (tbachman, 22:06:21)
    15. other items: enhance RAFT implementation for openflowplugin (tbachman, 22:14:03)
    16. ACTION: moiz and tpantelis will create bugs for known issues (tbachman, 22:18:06)
    17. ACTION: jmedved to look into hackathons (tbachman, 22:21:11)
    18. ACTION: alagalah to help set up IRC channel (tbachman, 22:21:22)
    19. ACTION: alagalah to work on setting up TWS call for clustering (tbachman, 22:21:59)
    20. ACTION: moiz to update design on wiki (tbachman, 22:22:52)
    21. ACTION: jmedved to contact phrobb to set up webex for meetings (tbachman, 22:23:54)


Meeting ended at 22:27:46 UTC (full logs).

Action items

  1. alagalah/tbachman to test use case numbers in data store, and report memory usage
  2. clustering group to ask community for performance characteristics they’re looking for
  3. moiz and tpantelis will create bugs for known issues
  4. jmedved to look into hackathons
  5. alagalah to help set up IRC channel
  6. alagalah to work on setting up TWS call for clustering
  7. moiz to update design on wiki
  8. jmedved to contact phrobb to set up webex for meetings


Action items, by person

  1. alagalah
    1. alagalah/tbachman to test use case numbers in data store, and report memory usage
    2. alagalah to help set up IRC channel
    3. alagalah to work on setting up TWS call for clustering
  2. jmedved
    1. jmedved to look into hackathons
    2. jmedved to contact phrobb to set up webex for meetings
  3. tbachman
    1. alagalah/tbachman to test use case numbers in data store, and report memory usage
  4. UNASSIGNED
    1. clustering group to ask community for performance characteristics they’re looking for
    2. moiz and tpantelis will create bugs for known issues
    3. moiz to update design on wiki


People present (lines said)

  1. tbachman (210)
  2. regXboi (23)
  3. odl_meetbot (17)
  4. alagalah (7)
  5. gzhao (2)
  6. rovarga (2)
  7. harman_ (2)
  8. raghu67 (1)
  9. jmedved (0)


Generated by MeetBot 0.1.4.