===================================== #opendaylight-meeting: odl-clustering ===================================== Meeting started by tbachman at 16:37:12 UTC. The full logs are available at http://meetings.opendaylight.org/opendaylight-meeting/2014/odl_clustering/opendaylight-meeting-odl_clustering.2014-10-01-16.37.log.html . Meeting summary --------------- * agenda (tbachman, 16:37:53) * technical aspects: strengths, weaknesses; where we are going (deployments/requirements, stable helium, lithium, long-term) (tbachman, 16:39:56) * team-projecect aspects: coding, integration, testing (application, longevity) (tbachman, 16:40:18) * alagalah asks why clustering is a feature as opposed to a core component of the controller (tbachman, 16:40:52) * jmedved says from a historical perspective, the SAL is there for applications which can hide the locality of where the applications are; clustering would be solved somewhere in the SAL at some point later (tbachman, 16:42:13) * after presentations on what was done with APIC and akka, those seemed to be good concepts which should be adopted, and these could be used inside the MD-SAL (tbachman, 16:44:21) * so akka provides the messaging for the message bus (tbachman, 16:44:47) * moiz notes we should probably start with reqiurements (tbachman, 16:46:26) * LINK: https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit (tbachman, 16:46:36) * requirements (tbachman, 16:47:58) * GBP draft scaling requirements: #link https://wiki.opendaylight.org/view/Group_Policy:Scaling (alagalah, 16:48:23) * LINK: https://wiki.opendaylight.org/view/Group_Policy:Scaling <= wiki page desribing requirements for GBP (tbachman, 16:49:07) * md-sal is a framework for building applications (tbachman, 16:50:47) * it has RPC, data, and notifications (tbachman, 16:50:54) * rovarga points out that data is really data + notifications (tbachman, 16:51:17) * for clustering, when making an RPC and the instance is on a different controller, that call has to be routed to the appropriate controller (tbachman, 16:52:10) * the clustering arch maintains a registry of all the service calls (tbachman, 16:52:45) * for example, instance 1, service foo is avilable, instance 2, service foo is available (tbachman, 16:53:12) * gossip is used as an eventually consistent protocol to distribute this registry (tbachman, 16:53:27) * as a concrete example, if the service is the openflowplugin, and there is a method, add flow (tbachman, 16:54:31) * where switch 1 is on instance 1, and switch 2 is on instance 2 (tbachman, 16:54:49) * so the registry would show it can provide addflow for switch 1 on instance 1 (tbachman, 16:55:13) * is there a mechanism to ask what services are provided? (tbachman, 16:55:41) * we don’t have that but could provide it (tbachman, 16:55:47) * there is an optimization where if the RPC is local, it doesn’t hit this path at all (i.e. just makes the local call directly w/o any translation) (tbachman, 16:56:41) * from an architectural perspective, clustering is just a layer of view (tbachman, 16:58:42) * when designing and application, you typicaly use the binding aware part, but internally, whatever’s stored in the data store is in the normalized format (tbachman, 17:03:35) * The binding aware data broker talks to the dom data broker, which works on the DOM store (tbachman, 17:04:09) * the DOM store has two implementations: an in-memory store and a clustered data store (tbachman, 17:04:55) * the current implementation uses the instance identifier, which contains the namespace of the module, to implement a strategy to create shards based on the namespace (tbachman, 17:06:29) * users would write code for their sharding strategy (tbachman, 17:07:29) * the operational and configuration data sores are already separated (tbachman, 17:08:10) * folks commented that the operational data could be kept in-memory and the configuration data could be persisted (tbachman, 17:11:44) * everything stored in clustered data store is journaled and snapshots are created (tbachman, 17:13:32) * This allows cluster restarts to have their configuration recreated (tbachman, 17:14:03) * moiz says that we still have to perform a test of getting the data from disk or getting it from the leader (tbachman, 17:20:43) * RAFT is being used, which is a consensus algorithm (tbachman, 17:25:56) * instead of inventing a new algorithm, RAFT was selected (tbachman, 17:26:10) * moiz presents logs of a leader and a follower (tbachman, 17:26:20) * there’s a term in the LOG, which is an election term, indicating that at this point, this was the leader (tbachman, 17:26:36) * description of RAFT algorithm ensues :) (tbachman, 17:27:32) * moiz notes that configurable persistence is something that will be addressed (tbachman, 17:32:17) * moiz says that each shard has it’s own in-memory data store (tbachman, 17:32:31) * so the clustered data store uses the in-memory data store (tbachman, 17:32:56) * the in-memory data store is the state of the shard (tbachman, 17:34:03) * the part that takes care of the persistence is a layer around the in-memory data store which ensures it persists and recovers from disk (using akka) (tbachman, 17:35:26) * To recover, it starts from a snapshot and replays the journal, and if no snapshot is available, it starts from the beginning of the journal (tbachman, 17:36:17) * snapshots are made every 20k journal entries, and is globally configurable (tbachman, 17:36:29) * regXboi asks if there’s a way for snapshotting on a time-scheduled basis (tbachman, 17:37:11) * moiz says it’s not done today, but akka persistence supports this (tbachman, 17:37:25) * regXboi says it would be in addition, not an alternative (tbachman, 17:37:43) * raghu67 notes the journal entries are also persisted, and the snapshot is only there to speed up recovery (tbachman, 17:38:06) * regXboi says that anything that can be persisted can be corrupted (tbachman, 17:38:18) * rovarga says a snapshot is a database check point (tbachman, 17:38:36) * those concerned could do a full data reconciliation of the shard (tbachman, 17:38:57) * alagalah asks if there’s anything like checksums for journal entries (tbachman, 17:40:08) * moiz says we don’t currently implement this (tbachman, 17:40:18) * raghu67 says it’s level DB implementation, so there may be something (tbachman, 17:40:32) * How do we detect journal / snapshot entry corruption. (alagalah, 17:41:00) * rovarga says that sharding strategy is not just how you carve up your data, but also reconciliation guarantees (tbachman, 17:42:55) * also how many replicas you keep, how paranoid you are, whether you persist at all, etc. (rovarga, 17:43:46) * rovarga says with checkpoints you can do some sanity check, but these are trade-offs on paranoia/performance trade-off (tbachman, 17:45:15) * discussion was also around due to variability in transaction size in the journal, whether SIZE should be a snapshotting criteria, ie rather than 20k entries, if it reaches xxxx MB, snapshot. (alagalah, 17:46:24) * persistence has sub-topics: snapshotting, journaling (tbachman, 17:46:47) * replication (tbachman, 17:47:37) * LINK: https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit <= google doc capturing some elements from the meeting (tbachman, 17:49:23) * regXboi says that since every shard has a single leader, there is no multi-master scenario (tbachman, 17:51:06) * moiz says that with the MD-SAL data store, we need to have a single master (tbachman, 17:51:29) * regXboi says there are some interesting usage scenarios that a single-master can’t touch (tbachman, 17:51:44) * multi-master wasn’t addressed for clustering for helium, as it’s more complex (tbachman, 17:52:07) * regXboi says that once you achieve geographic diversity, you don’t care what the sharding strategy is, and multi-master becomes an issue (tbachman, 17:53:35) * moiz says there are use cases for multi-master, but it’s harder to do (tbachman, 17:53:51) * moiz asks what are the types of things applications will want to do (e.g. cluster aware apps?) (tbachman, 17:54:18) * moiz says that multi master and transactions is questionable (tbachman, 17:55:34) * moiz says the current in-memory data store has a single commiter, so we may have to move away from this implementation of the in-memory data store to support multi-master (tbachman, 17:56:43) * to improve performance, the shards have to be located with the shard leaders (tbachman, 18:10:45) * this begs the question of needing notifications when shard leaders change (tbachman, 18:11:12) * rovarga says that tuning requires going to the next layer of writing additional code to optimize (tbachman, 18:13:13) * as an example, using openflowplugin with switches, the applications may care about where the shard leader is and move the leader if needed (tbachman, 18:14:41) * each switch in the network can have it’s own shard, which is colocated with the master instance that manages that switch (tbachman, 18:18:14) * another approach is to have a shard contain the switches that the instance owns (tbachman, 18:19:47) * regXboi asks how do you define that an openflow switch is colocated with a controller instance (tbachman, 18:21:15) * this is logical control colocation, not physical (tbachman, 18:21:40) * moiz continues with 3 node replication example (tbachman, 18:22:35) * the replicas go through an election process (tbachman, 18:22:44) * as soon as shard comes up, it’s a follower, and waits to be contacted by a leader for some period of time (tbachman, 18:23:12) * after the timeout it becomes a candidate and seeks votes (tbachman, 18:23:31) * once the candidate receives a majority of votes, it becomes a leader and sends heartbeats to all the other nodes (tbachman, 18:23:53) * the number of nodes is defined in configuration, so the node knows how many votes are needed for a majority (tbachman, 18:24:15) * this means you can’t have a 3-node cluster with only one node coming up and have it become the leader (tbachman, 18:25:37) * all the transaction requests are forwarded to the leader (tbachman, 18:25:50) * An example case is provided with a single leader and two followers (tbachman, 18:26:14) * When a commit happens, the first thing the leader does is write to the joural (tbachman, 18:27:14) * at the same time, replicas are sent to the followers (tbachman, 18:27:39) * the followers then write them to their local journals (tbachman, 18:28:00) * each follower reports back after writing to it’s journal, and the leader waits for the commit index to indicate that the followers have completed (tbachman, 18:28:45) * at this point, the leader’s in-memory data store can be updated (tbachman, 18:29:05) * the current in-memory data store requires that a current transaction is completed before another transaction can be submitted (only one transaction in can-commit and pre-commit states at a time) (tbachman, 18:30:27) * rovarga says that within a shard, you want serializable consistency (tbachman, 18:32:21) * each follower creates a new local transaction to commit the replica to their local in-memory data store (tbachman, 18:33:35) * well... causal consistency is probably good enough, but IMDS does serializable simply because extracting/analyzing potential parallelism has comparable performance cost as just doing the actual work (rovarga, 18:34:26) * you need to have an event when becoming a leader to be able to catch up (tbachman, 18:35:54) * there is no consistency guarantee across shard boundaries (tbachman, 18:45:28) * If you want notifications from two different sub-trees, they would have to be in the same shard (tbachman, 18:46:41) * within a shard, the data change notifications provide a serialized view of the changes (tbachman, 18:47:11) * ghall voices concern about application writers having to know this level of complexity, and wonders if this can be managed using layering (tbachman, 18:52:46) * raghu67 notes that we can support this, but each layer becomes domain specific (tbachman, 18:56:02) * the group is trying to determine the notifications, etc. required of just the data store itself (tbachman, 18:56:23) * and layers can be placed above this that can simplify things for application developers (tbachman, 18:56:42) * jmedved says that remote notifications might be needed (tbachman, 18:57:46) * raghu67 says we could use subscription, and deliver the notifications based on where it was registered (tbachman, 19:02:25) * we’re missing a notification when a follower becomes a leader (bug/add) (tbachman, 19:04:58) * notifications (tbachman, 20:09:22) * moiz asks when we register ourselves to the consumer, do we need to identify ourselves (tbachman, 20:09:53) * and whether this is an API enhancement (tbachman, 20:10:20) * rovarga says you can just do QName, and that says give all notifications (tbachman, 20:10:37) * you can also have something more flexible, at which point the question becomes are we going to declare it or just define a filter (tbachman, 20:10:59) * moiz says how would you like to get notifications only when they’re local, and then for all notifications (tbachman, 20:11:37) * registerChangeListener(scope, identifier, listener) is what we have currently (tbachman, 20:12:48) * do we enhance this API, to be (scope, only me, identifier, listener)? (tbachman, 20:13:17) * ttkacik says you could have two data brokers: local, and super shard (tbachman, 20:14:08) * the case that we’re trying to support with this is where the listener could be anywhere in the cluster (tbachman, 20:15:23) * rovarga proposes an interface DataChangeListener that has an onDataChanged() method, and have other interfaces that extend this, such as NotifyOneOnly interfac that implements a getIdentity() method (tbachman, 20:26:18) * regXboi +1 to that idea (regXboi, 20:27:16) * this makes applications be cluster aware (tbachman, 20:27:24) * this also allows code that works on one node to also work on multiple nodes (tbachman, 20:28:06) * for cluster-aware applications, we block unless we’re the leader (tbachman, 20:36:22) * ghall asks if the model has to be a DAG. Answer is yes (tbachman, 20:59:23) * ghall asks if there’s a way to write code once, and regardless of the sharding scheme of the leafrefs, the notifications will get back to the listener (tbachman, 21:00:18) * rovarga says we are honoring all explicitly stated references (tbachman, 21:01:24) * logistics and planning (tbachman, 21:08:26) * tpantelis asked if we need a sync up call (tbachman, 21:08:36) * jmedved agrees we need one (tbachman, 21:08:42) * discussion on resources — HP, Brocade, Cisco all looking to contribute resources (tbachman, 21:09:00) * plan is to leverage the monday 8am PST MD-SAL meetings for covering clustering (tbachman, 21:11:20) * LINK: https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Proposed Lithium simultaneous release plan (tbachman, 21:25:40) * hackers/design meetings are monday morning 8am PST (tbachman, 21:25:49) * possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? (tbachman, 21:26:05) * webex meetings (tbachman, 21:26:20) * IRC channel — team may set one up (tbachman, 21:26:43) * will put design on the wiki (tbachman, 21:26:59) * Possibly review this on the TWS calls (tbachman, 21:28:15) * lithium requirements (tbachman, 21:29:32) * hardening and performance is #1 (tbachman, 21:33:37) * details for hardening and performance: use streaming of NormalizedNode; configurable persistence; don’t serialize/stream NormalizedNode when message is local (tbachman, 21:35:58) * test bed requirements: 1 setup for 1 node-integration tests, 5 5-node cluster for testers, 1 5-node cluster for longevity tests for 1 day, 1 5-node cluster for longevity tests for 1 week, and 1 5-node cluster for longevity 1 month tests (tbachman, 21:46:29) * other items: programmatic sharding and team config (tbachman, 21:47:07) * other items: updates to data change deliveries (“me only”, “all listeners”) (tbachman, 21:49:25) * other items: notifications (tbachman, 21:49:32) * other items: finer grained sharding (tbachman, 21:50:08) * other items: data broker for clustered data store (tbachman, 21:50:16) * performance numbers: for GBP Unfiied Communications, 240 flows/second, 100k endpoints, (tbachman, 21:54:23) * performance numbers for GBP NFV: 10M endpoints (tbachman, 21:57:36) * ACTION: alagalah/tbachman to test use case numbers in data store, and report memory usage (tbachman, 22:02:21) * ACTION: clustering group to ask community for performance characteristics they’re looking for (tbachman, 22:03:06) * maybe include reference configurations on the wiki (tbachman, 22:05:49) * group may schedule some hackathons for clustering (tbachman, 22:06:21) * other items: enhance RAFT implementation for openflowplugin (tbachman, 22:14:03) * ACTION: moiz and tpantelis will create bugs for known issues (tbachman, 22:18:06) * ACTION: jmedved to look into hackathons (tbachman, 22:21:11) * ACTION: alagalah to help set up IRC channel (tbachman, 22:21:22) * ACTION: alagalah to work on setting up TWS call for clustering (tbachman, 22:21:59) * ACTION: moiz to update design on wiki (tbachman, 22:22:52) * ACTION: jmedved to contact phrobb to set up webex for meetings (tbachman, 22:23:54) Meeting ended at 22:27:46 UTC. Action items, by person ----------------------- * alagalah * alagalah/tbachman to test use case numbers in data store, and report memory usage * alagalah to help set up IRC channel * alagalah to work on setting up TWS call for clustering * jmedved * jmedved to look into hackathons * jmedved to contact phrobb to set up webex for meetings * tbachman * alagalah/tbachman to test use case numbers in data store, and report memory usage * **UNASSIGNED** * clustering group to ask community for performance characteristics they’re looking for * moiz and tpantelis will create bugs for known issues * moiz to update design on wiki People present (lines said) --------------------------- * tbachman (210) * regXboi (23) * odl_meetbot (17) * alagalah (7) * gzhao (2) * rovarga (2) * harman_ (2) * raghu67 (1) * jmedved (0) Generated by `MeetBot`_ 0.1.4