16:37:12 #startmeeting odl-clustering 16:37:12 Meeting started Wed Oct 1 16:37:12 2014 UTC. The chair is tbachman. Information about MeetBot at http://ci.openstack.org/meetbot.html. 16:37:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:37:12 The meeting name has been set to 'odl_clustering' 16:37:17 #chair alagalah 16:37:17 Current chairs: alagalah tbachman 16:37:39 #chair rovarga 16:37:39 Warning: Nick not in channel: rovarga 16:37:39 Current chairs: alagalah rovarga tbachman 16:37:53 #topic agenda 16:37:54 #chair jmedved 16:37:54 Current chairs: alagalah jmedved rovarga tbachman 16:38:04 WebEx Meeting Link https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit 16:38:12 #chair raghu67 16:38:12 Current chairs: alagalah jmedved raghu67 rovarga tbachman 16:38:16 raghu67: thx! 16:39:56 #info technical aspects: strengths, weaknesses; where we are going (deployments/requirements, stable helium, lithium, long-term) 16:40:18 #info team-projecect aspects: coding, integration, testing (application, longevity) 16:40:52 #info alagalah asks why clustering is a feature as opposed to a core component of the controller 16:42:13 #info jmedved says from a historical perspective, the SAL is there for applications which can hide the locality of where the applications are; clustering would be solved somewhere in the SAL at some point later 16:44:21 #info after presentations on what was done with APIC and akka, those seemed to be good concepts which should be adopted, and these could be used inside the MD-SAL 16:44:47 #info so akka provides the messaging for the message bus 16:44:59 others — please provide any corrections as you see fit :) 16:46:22 is there a webex link for this meeting? 16:46:26 #info moiz notes we should probably start with reqiurements 16:46:36 https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit 16:46:37 [09:38am] 16:46:38 oops 16:46:48 wrong thing 16:47:06 webex link 16:47:07 https://cisco.webex.com/cisco/j.php?MTID=m378bc189d3a937e254208d9d3f46b5d6 16:47:28 harman_: thx! 16:47:44 harman_: thanks 16:47:58 #topic requirements 16:48:23 #info GBP draft scaling requirements: #link https://wiki.opendaylight.org/view/Group_Policy:Scaling 16:48:26 #link https://wiki.opendaylight.org/view/Group_Policy:Scaling 16:48:51 #undo 16:48:51 Removing item from minutes: 16:49:07 #link https://wiki.opendaylight.org/view/Group_Policy:Scaling <= wiki page desribing requirements for GBP 16:50:47 #info md-sal is a framework for building applications 16:50:54 #info it has RPC, data, and notifications 16:51:17 #info rovarga points out that data is really data + notifications 16:52:10 #info for clustering, when making an RPC and the instance is on a different controller, that call has to be routed to the appropriate controller 16:52:45 #info the clustering arch maintains a registry of all the service calls 16:53:12 #info for example, instance 1, service foo is avilable, instance 2, service foo is available 16:53:27 #info gossip is used as an eventually consistent protocol to distribute this registry 16:54:31 #info as a concrete example, if the service is the openflowplugin, and there is a method, add flow 16:54:49 #info where switch 1 is on instance 1, and switch 2 is on instance 2 16:55:13 #info so the registry would show it can provide addflow for switch 1 on instance 1 16:55:41 #info is there a mechanism to ask what services are provided? 16:55:47 #info we don’t have that but could provide it 16:56:41 #info there is an optimization where if the RPC is local, it doesn’t hit this path at all (i.e. just makes the local call directly w/o any translation) 16:58:42 #info from an architectural perspective, clustering is just a layer of view 17:03:35 #info when designing and application, you typicaly use the binding aware part, but internally, whatever’s stored in the data store is in the normalized format 17:03:51 The binding aware data broker talks to the dom data broker, which works on the DOM store 17:04:09 #info The binding aware data broker talks to the dom data broker, which works on the DOM store 17:04:29 #info the data store has two implementations: an in-memory store and a clustered data store 17:04:38 #undo 17:04:38 Removing item from minutes: 17:04:55 #info the DOM store has two implementations: an in-memory store and a clustered data store 17:06:29 #info the current implementation uses the instance identifier, which contains the namespace of the module, to implement a strategy to create shards based on the namespace 17:07:29 #info users would write code for their sharding strategy 17:08:10 #info the operational and configuration data sores are already separated 17:11:44 #info folks commented that the operational data could be kept in-memory and the configuration data could be persisted 17:13:32 #info everything stored in clustered data store is journaled and snapshots are created 17:14:03 #info This allows cluster restarts to have their configuration recreated 17:20:43 #info moiz says that we still have to perform a test of getting the data from disk or getting it from the leader 17:21:38 are others having issues with the webex audio? 17:22:41 tbachman: webex audio quality is pretty iffy 17:23:24 regXboi: ACK — we have a single mic I think 17:23:32 ok 17:23:42 so you get to be very good scribe then :) 17:24:19 :) 17:24:20 lol 17:24:33 some of this is tricky to capture, but will try :) 17:24:47 * tbachman cracks fingers, picks up “quick-pen" 17:25:16 I'm hearing leader and follower and I'm not 100% sure what we are talking about 17:25:56 #info RAFT is being used, which is a consensus algorithm 17:26:10 #info instead of inventing a new algorithm, RAFT was selected 17:26:20 #info moiz presents logs of a leader and a follower 17:26:36 #info there’s a term in the LOG, which is an election term, indicating that at this point, this was the leader 17:27:32 #info description of RAFT algorithm ensues :) 17:27:54 tbachman: is the question how to recover a failed node? 17:28:57 #info regXboi I think they’re explaining how RAFT is implemented, but alagalah is asking if we can create a batched list of transactions, rather than looking at them one by one 17:29:04 #undo 17:29:04 Removing item from minutes: 17:29:04 lol 17:29:14 why are we doing transactions at all in this case? 17:29:33 the recovering node should discover the leader (somehow) and pull the state 17:29:34 regXboi: are you on the webex audio? Can you ask? 17:29:43 I can proxy if you need it 17:29:44 I'm in another conversation audo 17:29:47 ah 17:29:49 er audio 17:30:31 regXboi: I think they’re talking about looking at the journal transactions to synchronize the state 17:31:01 * tbachman isn’t sure if that helps 17:31:20 I'm hearing asynchronicity, which is good (I hope) 17:32:17 #info moiz notes that configurable persistence is something that will be addressed 17:32:31 #info moiz says that each shard has it’s own in-memory data store 17:32:56 #info so the clustered data store uses the in-memory data store 17:33:15 so that begs how it is recovered 17:34:03 #info the in-memory data store is the state of the shard 17:35:26 #info the part that takes care of the persistence is a layer around the in-memory data store which ensures it persists and recovers from disk (using akka) 17:36:17 #info To recover, it starts from a snapshot and replays the journal, and if no snapshot is available, it starts from the beginning of the journal 17:36:29 #info snapshots are made every 20k journal entries, and is globally configurable 17:37:11 #info regXboi asks if there’s a way for snapshotting on a time-scheduled basis 17:37:25 #info moiz says it’s not done today, but akka persistence supports this 17:37:43 #info regXboi says it would be in addition, not an alternative 17:38:06 #info raghu67 notes the journal entries are also persisted, and the snapshot is only there to speed up recovery 17:38:18 #info regXboi says that anything that can be persisted can be corrupted 17:38:36 #info rovarga says a snapshot is a database check point 17:38:57 #info those concerned could do a full data reconciliation of the shard 17:40:08 #info alagalah asks if there’s anything like checksums for journal entries 17:40:18 #info moiz says we don’t currently implement this 17:40:32 #info raghu67 says it’s level DB implementation, so there may be something 17:41:00 #info How do we detect journal / snapshot entry corruption. 17:42:55 #info rovarga says that sharding strategy is not just how you carve up your data, but also reconciliation guarantees 17:43:37 * tbachman takes all comers for corrections :) 17:43:46 #info also how many replicas you keep, how paranoid you are, whether you persist at all, etc. 17:45:15 #Info rovarga says with checkpoints you can do some sanity check, but these are trade-offs on paranoia/performance trade-off 17:46:24 #info discussion was also around due to variability in transaction size in the journal, whether SIZE should be a snapshotting criteria, ie rather than 20k entries, if it reaches xxxx MB, snapshot. 17:46:47 #info persistence has sub-topics: snapshotting, journaling 17:47:37 #topic replication 17:47:55 replication how? 17:48:12 are we talking about between nodes in a cluster? 17:48:19 are we talking about between clusters? 17:48:26 that's a rather open topic :) 17:49:23 #link https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit <= google doc capturing some elements from the meeting 17:51:06 #info regXboi says that since every shard has a single leader, there is no multi-master scenario 17:51:29 #info moiz says that with the MD-SAL data store, we need to have a single master 17:51:44 #info regXboi says there are some interesting usage scenarios that a single-master can’t touch 17:52:07 #info multi-master wasn’t addressed for clustering for helium, as it’s more complex 17:53:35 #info regXboi says that once you achieve geographic diversity, you don’t care what the sharding strategy is, and multi-master becomes an issue 17:53:51 #info moiz says there are use cases for multi-master, but it’s harder to do 17:54:18 #info moiz asks what are the types of things applications will want to do (e.g. cluster aware apps?) 17:55:34 #info moiz says that multi master and transactions is questionable 17:55:50 #info we may have to move away from an in-memory data store to support that 17:56:06 #undo 17:56:06 Removing item from minutes: 17:56:43 #info moiz says the current in-memory data store has a single commiter, so we may have to move away from this implementation of the in-memory data store to support multi-master 18:07:48 I'm not sure I heard that - can somebody transcribe? 18:08:20 regXboi: sorry — am back 18:09:01 I heard some things that worry me 18:09:14 "careful placing of services" and "careful placing of leaders"? 18:09:21 regXboi: elaborate? (or throw a question online) 18:10:45 #info to improve performance, the shards have to be located with the shard leaders 18:11:12 #info this begs the question of needing notifications when shard leaders change 18:13:13 #info rovarga says that tuning requires going to the next layer of writing additional code to optimize 18:14:41 #info as an example, using openflowplugin with switches, the applications may care about where the shard leader is and move the leader if needed 18:18:14 #info each switch in the network can have it’s own shard, which is colocated with the master instance that manages that switch 18:19:47 #info another approach is to have a shard contain the switches that the instance owns 18:21:15 #info regXboi asks how do you define that an openflow switch is colocated with a controller instance 18:21:40 #info this is logical control colocation, not physical 18:22:35 #info moiz continues with 3 node replication example 18:22:44 #info the replicas go through an election process 18:22:56 #info as soon as a shard comes up, it 18:22:57 #undo 18:22:57 Removing item from minutes: 18:23:12 #info as soon as shard comes up, it’s a follower, and waits to be contacted by a leader for some period of time 18:23:31 #info after the timeout it becomes a candidate and seeks votes 18:23:53 #info once the candidate receives a majority of votes, it becomes a leader and sends heartbeats to all the other nodes 18:24:15 #info the number of nodes is defined in configuration, so the node knows how many votes are needed for a majority 18:25:37 #info this means you can’t have a 3-node cluster with only one node coming up and have it become the leader 18:25:50 #info all the transaction requests are forwarded to the leader 18:26:14 #info An example case is provided with a single leader and two followers 18:27:14 #info When a commit happens, the first thing the leader does is write to the joural 18:27:39 #info at the same time, replicas are sent to the followers 18:28:00 #info the followers then write them to their local journals 18:28:45 #info each follower reports back after writing to it’s journal, and the leader waits for the commit index to indicate that the followers have completed 18:29:05 #info at this point, the leader’s in-memory data store can be updated 18:30:27 #info the current in-memory data store requires that a current transaction is completed before another transaction can be submitted (only one transaction in can-commit and pre-commit states at a time) 18:32:21 #info rovarga says that within a shard, you want serializable consistency 18:33:35 #info each follower creates a new local transaction to commit the replica to their local in-memory data store 18:34:26 #info well... causal consistency is probably good enough, but IMDS does serializable simply because extracting/analyzing potential parallelism has comparable performance cost as just doing the actual work 18:34:41 rovarga: thx! :) 18:35:54 #info you need to have an event when becoming a leader to be able to catch up 18:45:28 #info there is no consistency guarantee across shard boundaries 18:46:41 #info If you want notifications from two different sub-trees, they would have to be in the same shard 18:47:11 #info within a shard, the data change notifications provide a serialized view of the changes 18:52:46 #info ghall voices concern about application writers having to know this level of complexity, and wonders if this can be managed using layering 18:55:03 tbachman: I need to wander to another meeting 18:55:09 will check the minutes later 18:55:27 regXboi: ack 18:55:33 will try to capture as best I can 18:56:02 #info raghu67 notes that we can support this, but each layer becomes domain specific 18:56:23 #info the group is trying to determine the notifications, etc. required of just the data store itself 18:56:42 #info and layers can be placed above this that can simplify things for application developers 18:57:46 #info jmedved says that remote notifications might be needed 18:59:46 * tbachman notes that capturing some of these concepts is a bit challenging :) 19:02:25 #info raghu67 says we could use subscription, and deliver the notifications based on where it was registered 19:03:19 all — we’ll be taking a break 19:03:30 will be back in 45 minutes 19:03:39 about 12:45pm PST 19:03:51 correction, 12:50 19:04:58 #info we’re missing a notification when a follower becomes a leader (bug/add) 20:08:20 we’re back folks 20:09:22 #topic notifications 20:09:53 #info moiz asks when we register ourselves to the consumer, do we need to identify ourselves 20:10:20 #info and whether this is an API enhancement 20:10:37 #info rovarga says you can just do QName, and that says give all notifications 20:10:59 #info you can also have something more flexible, at which point the question becomes are we going to declare it or just define a filter 20:11:37 #info moiz says how would you like to get notifications only when they’re local, and then for all notifications 20:12:48 #info registerChangeListener(scope, identifier, listener) is what we have currently 20:13:17 #info do we enhance this API, to be (scope, only me, identifier, listener)? 20:14:08 #info ttkacik says you could have two data brokers: local, and super shard 20:15:23 #info the case that we’re trying to support with this is where the listener could be anywhere in the cluster 20:26:18 #info rovarga proposes an interface DataChangeListener that has an onDataChanged() method, and have other interfaces that extend this, such as NotifyOneOnly interfac that implements a getIdentity() method 20:27:16 #info regXboi +1 to that idea 20:27:24 #info this makes applications be cluster aware 20:28:06 #info this also allows code that works on one node to also work on multiple nodes 20:36:22 #info for cluster-aware applications, we block unless we’re the leader 20:59:23 #info ghall asks if the model has to be a DAG. Answer is yes 21:00:18 #info ghall asks if there’s a way to write code once, and regardless of the sharding scheme of the leafrefs, the notifications will get back to the listener 21:01:24 #info rovarga says we are honoring all explicitly stated references 21:08:26 #topic logistics and planning 21:08:36 #info tpantelis asked if we need a sync up call 21:08:42 #info jmedved agrees we need one 21:09:00 #info discussion on resources — HP, Brocade, Cisco all looking to contribute resources 21:11:20 #info plan is to leverage the monday 8am PST MD-SAL meetings for covering clustering 21:18:14 #link https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Lithium simultaneous release plan 21:24:22 #info possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? total 21:24:47 tbachman: that's the proposed release plan 21:24:53 #info hackers/design meetings are monday morning 8am PST 21:24:56 regXboi: ACK 21:25:05 we have to start somewhere :) 21:25:17 #undo 21:25:17 Removing item from minutes: 21:25:19 #undo 21:25:19 Removing item from minutes: 21:25:22 #undo 21:25:22 Removing item from minutes: 21:25:40 #link https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Proposed Lithium simultaneous release plan 21:25:49 #info hackers/design meetings are monday morning 8am PST 21:26:05 #info possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? 21:26:20 #info webex meetings 21:26:43 #info IRC channel — team may set one up 21:26:59 #info will put design on the wiki 21:28:15 #info Possibly review this on the TWS calls 21:29:32 #topic lithium requirements 21:33:37 #info hardening and performance is #1 21:35:58 #info details for hardening and performance: use streaming of NormalizedNode; configurable persistence; don’t serialize/stream NormalizedNode when message is local 21:46:29 #info test bed requirements: 1 setup for 1 node-integration tests, 5 5-node cluster for testers, 1 5-node cluster for longevity tests for 1 day, 1 5-node cluster for longevity tests for 1 week, and 1 5-node cluster for longevity 1 month tests 21:47:07 #info other items: programmatic sharding and team config 21:48:34 #info other items: notifications 21:48:41 #undo 21:48:41 Removing item from minutes: 21:49:25 #info other items: updates to data change deliveries (“me only”, “all listeners”) 21:49:32 #info other items: notifications 21:50:08 #info other items: finer grained sharding 21:50:16 #Info other items: data broker for clustered data store 21:54:23 #info performance numbers: for GBP Unfiied Communications, 240 flows/second, 100k endpoints, 21:57:36 #info performance numbers for GBP NFV: 10M endpoints 22:02:21 #action alagalah/tbachman to test use case numbers in data store, and report memory usage 22:03:06 #action clustering group to ask community for performance characteristics they’re looking for 22:05:49 #info maybe include reference configurations on the wiki 22:06:21 #info group may schedule some hackathons for clustering 22:14:03 #info other items: enhance RAFT implementation for openflowplugin 22:18:06 #action moiz and tpantelis will create bugs for known issues 22:21:11 #action jmedved to look into hackathons 22:21:22 #action alagalah to help set up IRC channel 22:21:59 #action alagalah to work on setting up TWS call for clustering 22:22:52 #action moiz to update design on wiki 22:23:54 #action jmedved to contact phrobb to set up webex for meetings 22:27:46 #endmeeting