16:37:12 <tbachman> #startmeeting odl-clustering 16:37:12 <odl_meetbot> Meeting started Wed Oct 1 16:37:12 2014 UTC. The chair is tbachman. Information about MeetBot at http://ci.openstack.org/meetbot.html. 16:37:12 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:37:12 <odl_meetbot> The meeting name has been set to 'odl_clustering' 16:37:17 <tbachman> #chair alagalah 16:37:17 <odl_meetbot> Current chairs: alagalah tbachman 16:37:39 <alagalah> #chair rovarga 16:37:39 <odl_meetbot> Warning: Nick not in channel: rovarga 16:37:39 <odl_meetbot> Current chairs: alagalah rovarga tbachman 16:37:53 <tbachman> #topic agenda 16:37:54 <alagalah> #chair jmedved 16:37:54 <odl_meetbot> Current chairs: alagalah jmedved rovarga tbachman 16:38:04 <raghu67> WebEx Meeting Link https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit 16:38:12 <alagalah> #chair raghu67 16:38:12 <odl_meetbot> Current chairs: alagalah jmedved raghu67 rovarga tbachman 16:38:16 <tbachman> raghu67: thx! 16:39:56 <tbachman> #info technical aspects: strengths, weaknesses; where we are going (deployments/requirements, stable helium, lithium, long-term) 16:40:18 <tbachman> #info team-projecect aspects: coding, integration, testing (application, longevity) 16:40:52 <tbachman> #info alagalah asks why clustering is a feature as opposed to a core component of the controller 16:42:13 <tbachman> #info jmedved says from a historical perspective, the SAL is there for applications which can hide the locality of where the applications are; clustering would be solved somewhere in the SAL at some point later 16:44:21 <tbachman> #info after presentations on what was done with APIC and akka, those seemed to be good concepts which should be adopted, and these could be used inside the MD-SAL 16:44:47 <tbachman> #info so akka provides the messaging for the message bus 16:44:59 <tbachman> others — please provide any corrections as you see fit :) 16:46:22 <gzhao> is there a webex link for this meeting? 16:46:26 <tbachman> #info moiz notes we should probably start with reqiurements 16:46:36 <tbachman> https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit 16:46:37 <tbachman> [09:38am] 16:46:38 <tbachman> oops 16:46:48 <tbachman> wrong thing 16:47:06 <harman_> webex link 16:47:07 <harman_> https://cisco.webex.com/cisco/j.php?MTID=m378bc189d3a937e254208d9d3f46b5d6 16:47:28 <tbachman> harman_: thx! 16:47:44 <gzhao> harman_: thanks 16:47:58 <tbachman> #topic requirements 16:48:23 <alagalah> #info GBP draft scaling requirements: #link https://wiki.opendaylight.org/view/Group_Policy:Scaling 16:48:26 <alagalah> #link https://wiki.opendaylight.org/view/Group_Policy:Scaling 16:48:51 <tbachman> #undo 16:48:51 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Link object at 0x258c390> 16:49:07 <tbachman> #link https://wiki.opendaylight.org/view/Group_Policy:Scaling <= wiki page desribing requirements for GBP 16:50:47 <tbachman> #info md-sal is a framework for building applications 16:50:54 <tbachman> #info it has RPC, data, and notifications 16:51:17 <tbachman> #info rovarga points out that data is really data + notifications 16:52:10 <tbachman> #info for clustering, when making an RPC and the instance is on a different controller, that call has to be routed to the appropriate controller 16:52:45 <tbachman> #info the clustering arch maintains a registry of all the service calls 16:53:12 <tbachman> #info for example, instance 1, service foo is avilable, instance 2, service foo is available 16:53:27 <tbachman> #info gossip is used as an eventually consistent protocol to distribute this registry 16:54:31 <tbachman> #info as a concrete example, if the service is the openflowplugin, and there is a method, add flow 16:54:49 <tbachman> #info where switch 1 is on instance 1, and switch 2 is on instance 2 16:55:13 <tbachman> #info so the registry would show it can provide addflow for switch 1 on instance 1 16:55:41 <tbachman> #info is there a mechanism to ask what services are provided? 16:55:47 <tbachman> #info we don’t have that but could provide it 16:56:41 <tbachman> #info there is an optimization where if the RPC is local, it doesn’t hit this path at all (i.e. just makes the local call directly w/o any translation) 16:58:42 <tbachman> #info from an architectural perspective, clustering is just a layer of view 17:03:35 <tbachman> #info when designing and application, you typicaly use the binding aware part, but internally, whatever’s stored in the data store is in the normalized format 17:03:51 <tbachman> The binding aware data broker talks to the dom data broker, which works on the DOM store 17:04:09 <tbachman> #info The binding aware data broker talks to the dom data broker, which works on the DOM store 17:04:29 <tbachman> #info the data store has two implementations: an in-memory store and a clustered data store 17:04:38 <tbachman> #undo 17:04:38 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x282fb10> 17:04:55 <tbachman> #info the DOM store has two implementations: an in-memory store and a clustered data store 17:06:29 <tbachman> #info the current implementation uses the instance identifier, which contains the namespace of the module, to implement a strategy to create shards based on the namespace 17:07:29 <tbachman> #info users would write code for their sharding strategy 17:08:10 <tbachman> #info the operational and configuration data sores are already separated 17:11:44 <tbachman> #info folks commented that the operational data could be kept in-memory and the configuration data could be persisted 17:13:32 <tbachman> #info everything stored in clustered data store is journaled and snapshots are created 17:14:03 <tbachman> #info This allows cluster restarts to have their configuration recreated 17:20:43 <tbachman> #info moiz says that we still have to perform a test of getting the data from disk or getting it from the leader 17:21:38 <regXboi> are others having issues with the webex audio? 17:22:41 <regXboi> tbachman: webex audio quality is pretty iffy 17:23:24 <tbachman> regXboi: ACK — we have a single mic I think 17:23:32 <regXboi> ok 17:23:42 <regXboi> so you get to be very good scribe then :) 17:24:19 <tbachman> :) 17:24:20 <tbachman> lol 17:24:33 <tbachman> some of this is tricky to capture, but will try :) 17:24:47 * tbachman cracks fingers, picks up “quick-pen" 17:25:16 <regXboi> I'm hearing leader and follower and I'm not 100% sure what we are talking about 17:25:56 <tbachman> #info RAFT is being used, which is a consensus algorithm 17:26:10 <tbachman> #info instead of inventing a new algorithm, RAFT was selected 17:26:20 <tbachman> #info moiz presents logs of a leader and a follower 17:26:36 <tbachman> #info there’s a term in the LOG, which is an election term, indicating that at this point, this was the leader 17:27:32 <tbachman> #info description of RAFT algorithm ensues :) 17:27:54 <regXboi> tbachman: is the question how to recover a failed node? 17:28:57 <tbachman> #info regXboi I think they’re explaining how RAFT is implemented, but alagalah is asking if we can create a batched list of transactions, rather than looking at them one by one 17:29:04 <tbachman> #undo 17:29:04 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2593ad0> 17:29:04 <tbachman> lol 17:29:14 <regXboi> why are we doing transactions at all in this case? 17:29:33 <regXboi> the recovering node should discover the leader (somehow) and pull the state 17:29:34 <tbachman> regXboi: are you on the webex audio? Can you ask? 17:29:43 <tbachman> I can proxy if you need it 17:29:44 <regXboi> I'm in another conversation audo 17:29:47 <tbachman> ah 17:29:49 <regXboi> er audio 17:30:31 <tbachman> regXboi: I think they’re talking about looking at the journal transactions to synchronize the state 17:31:01 * tbachman isn’t sure if that helps 17:31:20 <regXboi> I'm hearing asynchronicity, which is good (I hope) 17:32:17 <tbachman> #info moiz notes that configurable persistence is something that will be addressed 17:32:31 <tbachman> #info moiz says that each shard has it’s own in-memory data store 17:32:56 <tbachman> #info so the clustered data store uses the in-memory data store 17:33:15 <regXboi> so that begs how it is recovered 17:34:03 <tbachman> #info the in-memory data store is the state of the shard 17:35:26 <tbachman> #info the part that takes care of the persistence is a layer around the in-memory data store which ensures it persists and recovers from disk (using akka) 17:36:17 <tbachman> #info To recover, it starts from a snapshot and replays the journal, and if no snapshot is available, it starts from the beginning of the journal 17:36:29 <tbachman> #info snapshots are made every 20k journal entries, and is globally configurable 17:37:11 <tbachman> #info regXboi asks if there’s a way for snapshotting on a time-scheduled basis 17:37:25 <tbachman> #info moiz says it’s not done today, but akka persistence supports this 17:37:43 <tbachman> #info regXboi says it would be in addition, not an alternative 17:38:06 <tbachman> #info raghu67 notes the journal entries are also persisted, and the snapshot is only there to speed up recovery 17:38:18 <tbachman> #info regXboi says that anything that can be persisted can be corrupted 17:38:36 <tbachman> #info rovarga says a snapshot is a database check point 17:38:57 <tbachman> #info those concerned could do a full data reconciliation of the shard 17:40:08 <tbachman> #info alagalah asks if there’s anything like checksums for journal entries 17:40:18 <tbachman> #info moiz says we don’t currently implement this 17:40:32 <tbachman> #info raghu67 says it’s level DB implementation, so there may be something 17:41:00 <alagalah> #info How do we detect journal / snapshot entry corruption. 17:42:55 <tbachman> #info rovarga says that sharding strategy is not just how you carve up your data, but also reconciliation guarantees 17:43:37 * tbachman takes all comers for corrections :) 17:43:46 <rovarga> #info also how many replicas you keep, how paranoid you are, whether you persist at all, etc. 17:45:15 <tbachman> #Info rovarga says with checkpoints you can do some sanity check, but these are trade-offs on paranoia/performance trade-off 17:46:24 <alagalah> #info discussion was also around due to variability in transaction size in the journal, whether SIZE should be a snapshotting criteria, ie rather than 20k entries, if it reaches xxxx MB, snapshot. 17:46:47 <tbachman> #info persistence has sub-topics: snapshotting, journaling 17:47:37 <tbachman> #topic replication 17:47:55 <regXboi> replication how? 17:48:12 <regXboi> are we talking about between nodes in a cluster? 17:48:19 <regXboi> are we talking about between clusters? 17:48:26 <regXboi> that's a rather open topic :) 17:49:23 <tbachman> #link https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit <= google doc capturing some elements from the meeting 17:51:06 <tbachman> #info regXboi says that since every shard has a single leader, there is no multi-master scenario 17:51:29 <tbachman> #info moiz says that with the MD-SAL data store, we need to have a single master 17:51:44 <tbachman> #info regXboi says there are some interesting usage scenarios that a single-master can’t touch 17:52:07 <tbachman> #info multi-master wasn’t addressed for clustering for helium, as it’s more complex 17:53:35 <tbachman> #info regXboi says that once you achieve geographic diversity, you don’t care what the sharding strategy is, and multi-master becomes an issue 17:53:51 <tbachman> #info moiz says there are use cases for multi-master, but it’s harder to do 17:54:18 <tbachman> #info moiz asks what are the types of things applications will want to do (e.g. cluster aware apps?) 17:55:34 <tbachman> #info moiz says that multi master and transactions is questionable 17:55:50 <tbachman> #info we may have to move away from an in-memory data store to support that 17:56:06 <tbachman> #undo 17:56:06 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2844390> 17:56:43 <tbachman> #info moiz says the current in-memory data store has a single commiter, so we may have to move away from this implementation of the in-memory data store to support multi-master 18:07:48 <regXboi> I'm not sure I heard that - can somebody transcribe? 18:08:20 <tbachman> regXboi: sorry — am back 18:09:01 <regXboi> I heard some things that worry me 18:09:14 <regXboi> "careful placing of services" and "careful placing of leaders"? 18:09:21 <tbachman> regXboi: elaborate? (or throw a question online) 18:10:45 <tbachman> #info to improve performance, the shards have to be located with the shard leaders 18:11:12 <tbachman> #info this begs the question of needing notifications when shard leaders change 18:13:13 <tbachman> #info rovarga says that tuning requires going to the next layer of writing additional code to optimize 18:14:41 <tbachman> #info as an example, using openflowplugin with switches, the applications may care about where the shard leader is and move the leader if needed 18:18:14 <tbachman> #info each switch in the network can have it’s own shard, which is colocated with the master instance that manages that switch 18:19:47 <tbachman> #info another approach is to have a shard contain the switches that the instance owns 18:21:15 <tbachman> #info regXboi asks how do you define that an openflow switch is colocated with a controller instance 18:21:40 <tbachman> #info this is logical control colocation, not physical 18:22:35 <tbachman> #info moiz continues with 3 node replication example 18:22:44 <tbachman> #info the replicas go through an election process 18:22:56 <tbachman> #info as soon as a shard comes up, it 18:22:57 <tbachman> #undo 18:22:57 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2845610> 18:23:12 <tbachman> #info as soon as shard comes up, it’s a follower, and waits to be contacted by a leader for some period of time 18:23:31 <tbachman> #info after the timeout it becomes a candidate and seeks votes 18:23:53 <tbachman> #info once the candidate receives a majority of votes, it becomes a leader and sends heartbeats to all the other nodes 18:24:15 <tbachman> #info the number of nodes is defined in configuration, so the node knows how many votes are needed for a majority 18:25:37 <tbachman> #info this means you can’t have a 3-node cluster with only one node coming up and have it become the leader 18:25:50 <tbachman> #info all the transaction requests are forwarded to the leader 18:26:14 <tbachman> #info An example case is provided with a single leader and two followers 18:27:14 <tbachman> #info When a commit happens, the first thing the leader does is write to the joural 18:27:39 <tbachman> #info at the same time, replicas are sent to the followers 18:28:00 <tbachman> #info the followers then write them to their local journals 18:28:45 <tbachman> #info each follower reports back after writing to it’s journal, and the leader waits for the commit index to indicate that the followers have completed 18:29:05 <tbachman> #info at this point, the leader’s in-memory data store can be updated 18:30:27 <tbachman> #info the current in-memory data store requires that a current transaction is completed before another transaction can be submitted (only one transaction in can-commit and pre-commit states at a time) 18:32:21 <tbachman> #info rovarga says that within a shard, you want serializable consistency 18:33:35 <tbachman> #info each follower creates a new local transaction to commit the replica to their local in-memory data store 18:34:26 <rovarga> #info well... causal consistency is probably good enough, but IMDS does serializable simply because extracting/analyzing potential parallelism has comparable performance cost as just doing the actual work 18:34:41 <tbachman> rovarga: thx! :) 18:35:54 <tbachman> #info you need to have an event when becoming a leader to be able to catch up 18:45:28 <tbachman> #info there is no consistency guarantee across shard boundaries 18:46:41 <tbachman> #info If you want notifications from two different sub-trees, they would have to be in the same shard 18:47:11 <tbachman> #info within a shard, the data change notifications provide a serialized view of the changes 18:52:46 <tbachman> #info ghall voices concern about application writers having to know this level of complexity, and wonders if this can be managed using layering 18:55:03 <regXboi> tbachman: I need to wander to another meeting 18:55:09 <regXboi> will check the minutes later 18:55:27 <tbachman> regXboi: ack 18:55:33 <tbachman> will try to capture as best I can 18:56:02 <tbachman> #info raghu67 notes that we can support this, but each layer becomes domain specific 18:56:23 <tbachman> #info the group is trying to determine the notifications, etc. required of just the data store itself 18:56:42 <tbachman> #info and layers can be placed above this that can simplify things for application developers 18:57:46 <tbachman> #info jmedved says that remote notifications might be needed 18:59:46 * tbachman notes that capturing some of these concepts is a bit challenging :) 19:02:25 <tbachman> #info raghu67 says we could use subscription, and deliver the notifications based on where it was registered 19:03:19 <tbachman> all — we’ll be taking a break 19:03:30 <tbachman> will be back in 45 minutes 19:03:39 <tbachman> about 12:45pm PST 19:03:51 <tbachman> correction, 12:50 19:04:58 <tbachman> #info we’re missing a notification when a follower becomes a leader (bug/add) 20:08:20 <tbachman> we’re back folks 20:09:22 <tbachman> #topic notifications 20:09:53 <tbachman> #info moiz asks when we register ourselves to the consumer, do we need to identify ourselves 20:10:20 <tbachman> #info and whether this is an API enhancement 20:10:37 <tbachman> #info rovarga says you can just do QName, and that says give all notifications 20:10:59 <tbachman> #info you can also have something more flexible, at which point the question becomes are we going to declare it or just define a filter 20:11:37 <tbachman> #info moiz says how would you like to get notifications only when they’re local, and then for all notifications 20:12:48 <tbachman> #info registerChangeListener(scope, identifier, listener) is what we have currently 20:13:17 <tbachman> #info do we enhance this API, to be (scope, only me, identifier, listener)? 20:14:08 <tbachman> #info ttkacik says you could have two data brokers: local, and super shard 20:15:23 <tbachman> #info the case that we’re trying to support with this is where the listener could be anywhere in the cluster 20:26:18 <tbachman> #info rovarga proposes an interface DataChangeListener that has an onDataChanged() method, and have other interfaces that extend this, such as NotifyOneOnly interfac that implements a getIdentity() method 20:27:16 <regXboi> #info regXboi +1 to that idea 20:27:24 <tbachman> #info this makes applications be cluster aware 20:28:06 <tbachman> #info this also allows code that works on one node to also work on multiple nodes 20:36:22 <tbachman> #info for cluster-aware applications, we block unless we’re the leader 20:59:23 <tbachman> #info ghall asks if the model has to be a DAG. Answer is yes 21:00:18 <tbachman> #info ghall asks if there’s a way to write code once, and regardless of the sharding scheme of the leafrefs, the notifications will get back to the listener 21:01:24 <tbachman> #info rovarga says we are honoring all explicitly stated references 21:08:26 <tbachman> #topic logistics and planning 21:08:36 <tbachman> #info tpantelis asked if we need a sync up call 21:08:42 <tbachman> #info jmedved agrees we need one 21:09:00 <tbachman> #info discussion on resources — HP, Brocade, Cisco all looking to contribute resources 21:11:20 <tbachman> #info plan is to leverage the monday 8am PST MD-SAL meetings for covering clustering 21:18:14 <tbachman> #link https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Lithium simultaneous release plan 21:24:22 <tbachman> #info possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? total 21:24:47 <regXboi> tbachman: that's the proposed release plan 21:24:53 <tbachman> #info hackers/design meetings are monday morning 8am PST 21:24:56 <tbachman> regXboi: ACK 21:25:05 <tbachman> we have to start somewhere :) 21:25:17 <tbachman> #undo 21:25:17 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2848750> 21:25:19 <tbachman> #undo 21:25:19 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2848590> 21:25:22 <tbachman> #undo 21:25:22 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Link object at 0x2848850> 21:25:40 <tbachman> #link https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Proposed Lithium simultaneous release plan 21:25:49 <tbachman> #info hackers/design meetings are monday morning 8am PST 21:26:05 <tbachman> #info possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? 21:26:20 <tbachman> #info webex meetings 21:26:43 <tbachman> #info IRC channel — team may set one up 21:26:59 <tbachman> #info will put design on the wiki 21:28:15 <tbachman> #info Possibly review this on the TWS calls 21:29:32 <tbachman> #topic lithium requirements 21:33:37 <tbachman> #info hardening and performance is #1 21:35:58 <tbachman> #info details for hardening and performance: use streaming of NormalizedNode; configurable persistence; don’t serialize/stream NormalizedNode when message is local 21:46:29 <tbachman> #info test bed requirements: 1 setup for 1 node-integration tests, 5 5-node cluster for testers, 1 5-node cluster for longevity tests for 1 day, 1 5-node cluster for longevity tests for 1 week, and 1 5-node cluster for longevity 1 month tests 21:47:07 <tbachman> #info other items: programmatic sharding and team config 21:48:34 <tbachman> #info other items: notifications 21:48:41 <tbachman> #undo 21:48:41 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2838c90> 21:49:25 <tbachman> #info other items: updates to data change deliveries (“me only”, “all listeners”) 21:49:32 <tbachman> #info other items: notifications 21:50:08 <tbachman> #info other items: finer grained sharding 21:50:16 <tbachman> #Info other items: data broker for clustered data store 21:54:23 <tbachman> #info performance numbers: for GBP Unfiied Communications, 240 flows/second, 100k endpoints, 21:57:36 <tbachman> #info performance numbers for GBP NFV: 10M endpoints 22:02:21 <tbachman> #action alagalah/tbachman to test use case numbers in data store, and report memory usage 22:03:06 <tbachman> #action clustering group to ask community for performance characteristics they’re looking for 22:05:49 <tbachman> #info maybe include reference configurations on the wiki 22:06:21 <tbachman> #info group may schedule some hackathons for clustering 22:14:03 <tbachman> #info other items: enhance RAFT implementation for openflowplugin 22:18:06 <tbachman> #action moiz and tpantelis will create bugs for known issues 22:21:11 <tbachman> #action jmedved to look into hackathons 22:21:22 <tbachman> #action alagalah to help set up IRC channel 22:21:59 <tbachman> #action alagalah to work on setting up TWS call for clustering 22:22:52 <tbachman> #action moiz to update design on wiki 22:23:54 <tbachman> #action jmedved to contact phrobb to set up webex for meetings 22:27:46 <tbachman> #endmeeting