#opendaylight-meeting log

16:37:12 <tbachman> #startmeeting odl-clustering
16:37:12 <odl_meetbot> Meeting started Wed Oct  1 16:37:12 2014 UTC.  The chair is tbachman. Information about MeetBot at http://ci.openstack.org/meetbot.html.
16:37:12 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:37:12 <odl_meetbot> The meeting name has been set to 'odl_clustering'
16:37:17 <tbachman> #chair alagalah
16:37:17 <odl_meetbot> Current chairs: alagalah tbachman
16:37:39 <alagalah> #chair rovarga
16:37:39 <odl_meetbot> Warning: Nick not in channel: rovarga
16:37:39 <odl_meetbot> Current chairs: alagalah rovarga tbachman
16:37:53 <tbachman> #topic agenda
16:37:54 <alagalah> #chair jmedved
16:37:54 <odl_meetbot> Current chairs: alagalah jmedved rovarga tbachman
16:38:04 <raghu67> WebEx Meeting Link https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit
16:38:12 <alagalah> #chair raghu67
16:38:12 <odl_meetbot> Current chairs: alagalah jmedved raghu67 rovarga tbachman
16:38:16 <tbachman> raghu67: thx!
16:39:56 <tbachman> #info technical aspects: strengths, weaknesses; where we are going (deployments/requirements, stable helium, lithium, long-term)
16:40:18 <tbachman> #info team-projecect aspects: coding, integration, testing (application, longevity)
16:40:52 <tbachman> #info alagalah asks why clustering is a feature as opposed to a core component of the controller
16:42:13 <tbachman> #info jmedved says from a historical perspective,  the SAL is there for applications which can hide the locality of where the applications are; clustering would be solved somewhere in the SAL at some point later
16:44:21 <tbachman> #info after presentations on what was done with APIC and akka, those seemed to be good concepts which should be adopted, and these could be used inside the MD-SAL
16:44:47 <tbachman> #info so akka provides the messaging for the message bus
16:44:59 <tbachman> others — please provide any corrections as you see fit :)
16:46:22 <gzhao> is there a webex link for this meeting?
16:46:26 <tbachman> #info moiz notes we should probably start with reqiurements
16:46:36 <tbachman> https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit
16:46:37 <tbachman> [09:38am]
16:46:38 <tbachman> oops
16:46:48 <tbachman> wrong thing
16:47:06 <harman_> webex link
16:47:07 <harman_> https://cisco.webex.com/cisco/j.php?MTID=m378bc189d3a937e254208d9d3f46b5d6
16:47:28 <tbachman> harman_: thx!
16:47:44 <gzhao> harman_: thanks
16:47:58 <tbachman> #topic requirements
16:48:23 <alagalah> #info GBP draft scaling requirements: #link https://wiki.opendaylight.org/view/Group_Policy:Scaling
16:48:26 <alagalah> #link https://wiki.opendaylight.org/view/Group_Policy:Scaling
16:48:51 <tbachman> #undo
16:48:51 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Link object at 0x258c390>
16:49:07 <tbachman> #link https://wiki.opendaylight.org/view/Group_Policy:Scaling <= wiki page desribing requirements for GBP
16:50:47 <tbachman> #info md-sal is a framework for building applications
16:50:54 <tbachman> #info it has RPC, data, and notifications
16:51:17 <tbachman> #info rovarga points out that data is really data + notifications
16:52:10 <tbachman> #info for clustering, when making an RPC and the instance is on a different controller, that call has to be routed to the appropriate controller
16:52:45 <tbachman> #info the clustering arch maintains a registry of all the service calls
16:53:12 <tbachman> #info for example, instance 1, service foo is avilable, instance 2, service foo is available
16:53:27 <tbachman> #info gossip is used as an eventually consistent protocol to distribute this registry
16:54:31 <tbachman> #info as a concrete example, if the service is the openflowplugin, and there is a method, add flow
16:54:49 <tbachman> #info where switch 1 is on instance 1, and switch 2 is on instance 2
16:55:13 <tbachman> #info so the registry would show it can provide addflow for switch 1 on instance 1
16:55:41 <tbachman> #info is there a mechanism to ask what services are provided?
16:55:47 <tbachman> #info we don’t have that but could provide it
16:56:41 <tbachman> #info there is an optimization where if the RPC is local, it doesn’t hit this path at all (i.e. just makes the local call directly w/o any translation)
16:58:42 <tbachman> #info from an architectural perspective, clustering is just a layer of view
17:03:35 <tbachman> #info when designing and application, you typicaly use the binding aware part, but internally, whatever’s stored in the data store is in the normalized format
17:03:51 <tbachman> The binding aware data broker talks to the dom data broker, which works on the DOM store
17:04:09 <tbachman> #info The binding aware data broker talks to the dom data broker, which works on the DOM store
17:04:29 <tbachman> #info the data store has two implementations: an in-memory store and a clustered data store
17:04:38 <tbachman> #undo
17:04:38 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x282fb10>
17:04:55 <tbachman> #info the DOM store has two implementations: an in-memory store and a clustered data store
17:06:29 <tbachman> #info the current implementation uses the instance identifier, which contains the namespace of the module, to implement a strategy to create shards based on the namespace
17:07:29 <tbachman> #info users would write code for their sharding strategy
17:08:10 <tbachman> #info the operational and configuration data sores are already separated
17:11:44 <tbachman> #info folks commented that the operational data could be kept in-memory and the configuration data could be persisted
17:13:32 <tbachman> #info everything stored in clustered data store is journaled and snapshots are created
17:14:03 <tbachman> #info This allows cluster restarts to have their configuration recreated
17:20:43 <tbachman> #info moiz says that we still have to perform a test of getting the data from disk or getting it from the leader
17:21:38 <regXboi> are others having issues with the webex audio?
17:22:41 <regXboi> tbachman: webex audio quality is pretty iffy
17:23:24 <tbachman> regXboi: ACK — we have a single mic I think
17:23:32 <regXboi> ok
17:23:42 <regXboi> so you get to be very good scribe then :)
17:24:19 <tbachman> :)
17:24:20 <tbachman> lol
17:24:33 <tbachman> some of this is tricky to capture, but will try :)
17:24:47 * tbachman cracks fingers, picks up “quick-pen"
17:25:16 <regXboi> I'm hearing leader and follower and I'm not 100% sure what we are talking about
17:25:56 <tbachman> #info RAFT is being used, which is a consensus algorithm
17:26:10 <tbachman> #info instead of inventing a new algorithm, RAFT was selected
17:26:20 <tbachman> #info moiz presents logs of a leader and a follower
17:26:36 <tbachman> #info there’s a term in the LOG, which is an election term, indicating that at this point, this was the leader
17:27:32 <tbachman> #info description of RAFT algorithm ensues :)
17:27:54 <regXboi> tbachman: is the question how to recover a failed node?
17:28:57 <tbachman> #info regXboi I think they’re explaining how RAFT is implemented, but alagalah is asking if we can create a batched list of transactions, rather than looking at them one by one
17:29:04 <tbachman> #undo
17:29:04 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2593ad0>
17:29:04 <tbachman> lol
17:29:14 <regXboi> why are we doing transactions at all in this case?
17:29:33 <regXboi> the recovering node should discover the leader (somehow) and pull the state
17:29:34 <tbachman> regXboi: are you on the webex audio?  Can you ask?
17:29:43 <tbachman> I can proxy if you need it
17:29:44 <regXboi> I'm in another conversation audo
17:29:47 <tbachman> ah
17:29:49 <regXboi> er audio
17:30:31 <tbachman> regXboi: I think they’re talking about looking at the journal transactions to synchronize the state
17:31:01 * tbachman isn’t sure if that helps
17:31:20 <regXboi> I'm hearing asynchronicity, which is good (I hope)
17:32:17 <tbachman> #info moiz notes that configurable persistence is something that will be addressed
17:32:31 <tbachman> #info moiz says that each shard has it’s own in-memory data store
17:32:56 <tbachman> #info so the clustered data store uses the in-memory data store
17:33:15 <regXboi> so that begs how it is recovered
17:34:03 <tbachman> #info the in-memory data store is the state of the shard
17:35:26 <tbachman> #info the part that takes care of the persistence is a layer around the in-memory data store which ensures it persists and recovers from disk (using akka)
17:36:17 <tbachman> #info To recover, it starts from a snapshot and replays the journal, and if no snapshot is available, it starts from the beginning of the journal
17:36:29 <tbachman> #info snapshots are made every 20k journal entries, and is globally configurable
17:37:11 <tbachman> #info regXboi asks if there’s a way for snapshotting on a time-scheduled basis
17:37:25 <tbachman> #info moiz says it’s not done today, but akka persistence supports this
17:37:43 <tbachman> #info regXboi says it would be in addition, not an alternative
17:38:06 <tbachman> #info raghu67 notes the journal entries are also persisted, and the snapshot is only there to speed up recovery
17:38:18 <tbachman> #info regXboi says that anything that can be persisted can be corrupted
17:38:36 <tbachman> #info rovarga says a snapshot is a database check point
17:38:57 <tbachman> #info those concerned could do a full data reconciliation of the shard
17:40:08 <tbachman> #info alagalah asks if there’s anything like checksums for journal entries
17:40:18 <tbachman> #info moiz says we don’t currently implement this
17:40:32 <tbachman> #info raghu67 says it’s level DB implementation, so there may be something
17:41:00 <alagalah> #info How do we detect journal / snapshot entry corruption.
17:42:55 <tbachman> #info rovarga says that sharding strategy is not just how you carve up your data, but also reconciliation guarantees
17:43:37 * tbachman takes all comers for corrections :)
17:43:46 <rovarga> #info also how many replicas you keep, how paranoid you are, whether you persist at all, etc.
17:45:15 <tbachman> #Info rovarga says with checkpoints you can do some sanity check, but these are trade-offs on paranoia/performance trade-off
17:46:24 <alagalah> #info discussion was also around due to variability in transaction size in the journal, whether SIZE should be a snapshotting criteria, ie rather than 20k entries, if it reaches xxxx MB, snapshot.
17:46:47 <tbachman> #info persistence has sub-topics: snapshotting, journaling
17:47:37 <tbachman> #topic replication
17:47:55 <regXboi> replication how?
17:48:12 <regXboi> are we talking about between nodes in a cluster?
17:48:19 <regXboi> are we talking about between clusters?
17:48:26 <regXboi> that's a rather open topic :)
17:49:23 <tbachman> #link https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit <= google doc capturing some elements from the meeting
17:51:06 <tbachman> #info regXboi says that since every shard has a single leader, there is no multi-master scenario
17:51:29 <tbachman> #info moiz says that with the MD-SAL data store, we need to have a single master
17:51:44 <tbachman> #info regXboi says there are some interesting usage scenarios that a single-master can’t touch
17:52:07 <tbachman> #info multi-master wasn’t addressed for clustering for helium, as it’s more complex
17:53:35 <tbachman> #info regXboi says that once you achieve geographic diversity, you don’t care what the sharding strategy is, and multi-master becomes an issue
17:53:51 <tbachman> #info moiz says there are use cases for multi-master, but it’s harder to do
17:54:18 <tbachman> #info moiz asks what are the types of things applications will want to do (e.g. cluster aware apps?)
17:55:34 <tbachman> #info moiz says that multi master and transactions is questionable
17:55:50 <tbachman> #info we may have to move away from an in-memory data store to support that
17:56:06 <tbachman> #undo
17:56:06 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2844390>
17:56:43 <tbachman> #info moiz says the current in-memory data store has a single commiter, so we may have to move away from this implementation of the in-memory data store to support multi-master
18:07:48 <regXboi> I'm not sure I heard that - can somebody transcribe?
18:08:20 <tbachman> regXboi: sorry — am back
18:09:01 <regXboi> I heard some things that worry me
18:09:14 <regXboi> "careful placing of services" and "careful placing of leaders"?
18:09:21 <tbachman> regXboi: elaborate? (or throw a question online)
18:10:45 <tbachman> #info to improve performance, the shards have to be located with the shard leaders
18:11:12 <tbachman> #info this begs the question of needing notifications when shard leaders change
18:13:13 <tbachman> #info rovarga says that tuning requires going to the next layer of writing additional code to optimize
18:14:41 <tbachman> #info as an example, using openflowplugin with switches, the applications may care about where the shard leader is and move the leader if needed
18:18:14 <tbachman> #info each switch in the network can have it’s own shard, which is colocated with the master instance that manages that switch
18:19:47 <tbachman> #info another approach is to have a shard contain the switches that the instance owns
18:21:15 <tbachman> #info regXboi asks how do you define that an openflow switch is colocated with a controller instance
18:21:40 <tbachman> #info this is logical control colocation, not physical
18:22:35 <tbachman> #info moiz continues with 3 node replication example
18:22:44 <tbachman> #info the replicas go through an election process
18:22:56 <tbachman> #info as soon as a shard comes up, it
18:22:57 <tbachman> #undo
18:22:57 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2845610>
18:23:12 <tbachman> #info as soon as  shard comes up, it’s a follower, and waits to be contacted by a leader for some period of time
18:23:31 <tbachman> #info after the timeout it becomes a candidate and seeks votes
18:23:53 <tbachman> #info once the candidate receives a majority of votes, it becomes a leader and sends heartbeats to all the other nodes
18:24:15 <tbachman> #info the number of nodes is defined in configuration, so the node knows how many votes are needed for a majority
18:25:37 <tbachman> #info this means you can’t have a 3-node cluster with only one node coming up and have it become the leader
18:25:50 <tbachman> #info all the transaction requests are forwarded to the leader
18:26:14 <tbachman> #info An example case is provided with a single leader and two followers
18:27:14 <tbachman> #info When a commit happens, the first thing the leader does is write to the joural
18:27:39 <tbachman> #info at the same time, replicas are sent to the followers
18:28:00 <tbachman> #info the followers then write them to their local journals
18:28:45 <tbachman> #info each follower reports back after writing to it’s journal, and the leader waits for the commit index to indicate that the followers have completed
18:29:05 <tbachman> #info at this point, the leader’s in-memory data store can be updated
18:30:27 <tbachman> #info the current in-memory data store requires that a current transaction is completed before another transaction can be submitted (only one transaction in can-commit and pre-commit states at a time)
18:32:21 <tbachman> #info rovarga says that within a shard, you want serializable consistency
18:33:35 <tbachman> #info each follower creates a new local transaction to commit the replica to their local in-memory data store
18:34:26 <rovarga> #info well... causal consistency is probably good enough, but IMDS does serializable simply because extracting/analyzing potential parallelism has comparable performance cost as just doing the actual work
18:34:41 <tbachman> rovarga: thx! :)
18:35:54 <tbachman> #info you need to have an event when becoming a leader to be able to catch up
18:45:28 <tbachman> #info there is no consistency guarantee across shard boundaries
18:46:41 <tbachman> #info If you want notifications from two different sub-trees, they would have to be in the same shard
18:47:11 <tbachman> #info within a shard, the data change notifications provide a serialized view of the changes
18:52:46 <tbachman> #info ghall voices concern about application writers having to know this level of complexity, and wonders if this can be managed using layering
18:55:03 <regXboi> tbachman: I need to wander to another meeting
18:55:09 <regXboi> will check the minutes later
18:55:27 <tbachman> regXboi: ack
18:55:33 <tbachman> will try to capture as best I can
18:56:02 <tbachman> #info raghu67 notes that we can support this, but each layer becomes domain specific
18:56:23 <tbachman> #info the group is trying to determine the notifications, etc. required of just the data store itself
18:56:42 <tbachman> #info and layers can be placed above this that can simplify things for application developers
18:57:46 <tbachman> #info jmedved says that remote notifications might be needed
18:59:46 * tbachman notes that capturing some of these concepts is a bit challenging :)
19:02:25 <tbachman> #info raghu67 says we could use subscription, and deliver the notifications based on where it was registered
19:03:19 <tbachman> all — we’ll be taking a break
19:03:30 <tbachman> will be back in 45 minutes
19:03:39 <tbachman> about 12:45pm PST
19:03:51 <tbachman> correction, 12:50
19:04:58 <tbachman> #info we’re missing a notification when a follower becomes a leader (bug/add)
20:08:20 <tbachman> we’re back folks
20:09:22 <tbachman> #topic notifications
20:09:53 <tbachman> #info moiz asks when we register ourselves to the consumer, do we need to identify ourselves
20:10:20 <tbachman> #info and whether this is an API enhancement
20:10:37 <tbachman> #info rovarga says you can just do QName, and that says give all notifications
20:10:59 <tbachman> #info you can also have something more flexible, at which point the question becomes are we going to declare it or just define a filter
20:11:37 <tbachman> #info moiz says how would you like to get notifications only when they’re local, and then for all notifications
20:12:48 <tbachman> #info registerChangeListener(scope, identifier, listener) is what we have currently
20:13:17 <tbachman> #info do we enhance this API, to be (scope, only me, identifier, listener)?
20:14:08 <tbachman> #info ttkacik says you could have two data brokers: local, and super shard
20:15:23 <tbachman> #info the case that we’re trying to support with this is where the listener could be anywhere in the cluster
20:26:18 <tbachman> #info rovarga proposes an interface DataChangeListener that has an onDataChanged() method, and have other interfaces that extend this, such as NotifyOneOnly interfac that implements a getIdentity() method
20:27:16 <regXboi> #info regXboi +1 to that idea
20:27:24 <tbachman> #info this makes applications be cluster aware
20:28:06 <tbachman> #info this also allows code that works on one node to also work on multiple nodes
20:36:22 <tbachman> #info for cluster-aware applications, we block unless we’re the leader
20:59:23 <tbachman> #info ghall asks if the model has to be a DAG. Answer is yes
21:00:18 <tbachman> #info ghall asks if there’s a way to write code once, and regardless of the sharding scheme of the leafrefs, the notifications will get back to the listener
21:01:24 <tbachman> #info rovarga says we are honoring all explicitly stated references
21:08:26 <tbachman> #topic logistics and planning
21:08:36 <tbachman> #info tpantelis asked if we need a sync up call
21:08:42 <tbachman> #info jmedved agrees we need one
21:09:00 <tbachman> #info discussion on resources — HP, Brocade, Cisco all looking to contribute resources
21:11:20 <tbachman> #info plan is to leverage the monday 8am PST MD-SAL meetings for covering clustering
21:18:14 <tbachman> #link https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Lithium simultaneous release plan
21:24:22 <tbachman> #info possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? total
21:24:47 <regXboi> tbachman: that's the proposed release plan
21:24:53 <tbachman> #info hackers/design meetings are monday morning 8am PST
21:24:56 <tbachman> regXboi: ACK
21:25:05 <tbachman> we have to start somewhere :)
21:25:17 <tbachman> #undo
21:25:17 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2848750>
21:25:19 <tbachman> #undo
21:25:19 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2848590>
21:25:22 <tbachman> #undo
21:25:22 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Link object at 0x2848850>
21:25:40 <tbachman> #link https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan Proposed Lithium simultaneous release plan
21:25:49 <tbachman> #info hackers/design meetings are monday morning 8am PST
21:26:05 <tbachman> #info possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+?
21:26:20 <tbachman> #info webex meetings
21:26:43 <tbachman> #info IRC channel — team may set one up
21:26:59 <tbachman> #info will put design on the wiki
21:28:15 <tbachman> #info Possibly review this on the TWS calls
21:29:32 <tbachman> #topic lithium requirements
21:33:37 <tbachman> #info hardening and performance is #1
21:35:58 <tbachman> #info details for hardening and performance: use streaming of NormalizedNode; configurable persistence; don’t serialize/stream NormalizedNode when message is local
21:46:29 <tbachman> #info test bed requirements: 1 setup for 1 node-integration tests, 5 5-node cluster for testers, 1 5-node cluster for longevity tests for 1 day, 1 5-node cluster for longevity tests for 1 week, and 1 5-node cluster for longevity 1 month tests
21:47:07 <tbachman> #info other items: programmatic sharding and team config
21:48:34 <tbachman> #info other items: notifications
21:48:41 <tbachman> #undo
21:48:41 <odl_meetbot> Removing item from minutes: <MeetBot.ircmeeting.items.Info object at 0x2838c90>
21:49:25 <tbachman> #info other items: updates to data change deliveries (“me only”, “all listeners”)
21:49:32 <tbachman> #info other items: notifications
21:50:08 <tbachman> #info other items: finer grained sharding
21:50:16 <tbachman> #Info other items: data broker for clustered data store
21:54:23 <tbachman> #info performance numbers: for GBP Unfiied Communications, 240 flows/second, 100k endpoints,
21:57:36 <tbachman> #info performance numbers for GBP NFV: 10M endpoints
22:02:21 <tbachman> #action alagalah/tbachman to test use case numbers in data store, and report memory usage
22:03:06 <tbachman> #action clustering group to ask community for performance characteristics they’re looking for
22:05:49 <tbachman> #info maybe include reference configurations on the wiki
22:06:21 <tbachman> #info group may schedule some hackathons for clustering
22:14:03 <tbachman> #info other items: enhance RAFT implementation for openflowplugin
22:18:06 <tbachman> #action moiz and tpantelis will create bugs for known issues
22:21:11 <tbachman> #action jmedved to look into hackathons
22:21:22 <tbachman> #action alagalah to help set up IRC channel
22:21:59 <tbachman> #action alagalah to work on setting up TWS call for clustering
22:22:52 <tbachman> #action moiz to update design on wiki
22:23:54 <tbachman> #action jmedved to contact phrobb to set up webex for meetings
22:27:46 <tbachman> #endmeeting