17:02:55 <dlenrow> #startmeeting clustering and HA enhancements
17:02:55 <odl_meetbot> Meeting started Tue Sep 30 17:02:55 2014 UTC.  The chair is dlenrow. Information about MeetBot at http://ci.openstack.org/meetbot.html.
17:02:55 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:02:55 <odl_meetbot> The meeting name has been set to 'clustering_and_ha_enhancements'
17:13:30 <dlenrow> #info questions about mapping cluster instances to devices and division of work.. Answer: that's out of scope for this discussion. This is about the distributed data store issues
17:13:33 <dfarrell07> Anyone taking notes atm?
17:13:38 <dfarrell07> ^^awesome
17:14:12 <dlenrow> #info resource links at end of deck point to documentation for those who want to learn more
17:21:32 <dlenrow> #info discussion about need and feasibility of a two node cluster
17:22:37 <dlenrow> #info establish that most customers want N < 3 for clustering. This is a market requirement
17:25:06 <dlenrow> #info discussion followed about different models and limitations of two node cluster. Establshed that survivability is less certain and failure modes worse in a two-node Akka cluster. Two nodes can work with master/slave but we want a single model for apps to address
17:26:46 <dlenrow> #AT&T comments that they will need big, globally distributed controller. Must be multiple instances.
17:27:55 <dlenrow> #discussion of HA/Clustering does not replace federation for scaling. Rob A. points out that AT&T may well need both HA/Clustering and federation to  achieve ends
17:30:20 <dlenrow> #Eventual consistency can be achieved with two-node cluster trivially, but many ODL requirements will need consistency.
17:32:06 <dlenrow> #detailed discussion of behaviors of different options/model under failure conditions.
17:33:43 <dlenrow> #discussion of tradeoffs of number of writers and readers and the effect on performance.
17:36:56 <dlenrow> Rob A. Claims that in most cases a single controller node can handle an enormous network (thousands of switches). Claims stats, analytics, etc. should not be bult in ODL. (Or at least using different instances than the basic control plane capabilites.
17:37:36 <dlenrow> Rex redirects presentation back to the slides about requirements and suggests a different break-out to continue to discuss clustering principles and theory.
17:39:04 <dlenrow> #slide 9 more detailed discussion of sharding and proposed enhancements for Lithium relative to Helium
17:40:27 <dlenrow> Question regarding automatic re-balancing of shards and is this in Lithium plan? Answer not planned, but good suggestion
17:42:38 <dlenrow> #question: Can we get away with not supporting transactions across shards. If we can live with this limitation, the performance will be enhanced.
17:43:11 <dlenrow> (sorry all really screwing up the #info prefix)
17:43:42 <dlenrow> #info statement made that the app designer designs the sharding strategy and can decide about this performance tradeoff.
17:44:35 <dlenrow> #info suggestion that cross-shard transaction can be disabled for folks who don't want to pay performance price.
17:47:36 <dlenrow> #info Colin D. approach is to define shards in such a way that they don't require distributed transactions and that we build bounded domains of consistency with no atte3mpt at consistency across them
17:48:23 <dlenrow> #info question about what is meant by programmatic shard config.
17:49:03 <dlenrow> #info answer is that shard config is currently read only once at startup. To add an app/shard to a running controller we need a way to update config after startup.
17:52:08 <dlenrow> #info Colin: we need to get apps running with shards to see where we have bottlenecks and what we need to do for optimization.
17:56:55 <dlenrow> #info Jan asks should we set some performance/scaling requirements for Lithium?
17:57:20 <dlenrow> #info Project team agrees to this
17:57:34 <dlenrow> #info 10 minute break
18:17:47 <dlenrow> #info Slide 10 Autonomous Data Replication
18:19:36 <dlenrow> #info one of the links to resources is to the RAFT consensus paper for those who want to learn more
18:21:16 <dlenrow> #info slide 12 shows evolution of features from Helium to planned Lithium
18:23:56 <dlenrow> #info slide 13  shows how distributed execution is made transparent by Akka services
18:25:13 <dlenrow> #info question: Can we use any actors external to the controller to help us identify partition and to recover.
18:25:39 <dlenrow> #info answer: Have looked at this idea. Not yet clear when or how to depend/support
18:27:48 <dlenrow> #keith burns talks on use cases related to GBP and performance requirements.
18:28:10 <dlenrow> #info keith burns talks on use cases related to GBP and performance requirements.
18:29:28 <dlenrow> #info Colin D. asks for more clarification of the config of apps. Big distirbuted app scaled out versus single instance running in a single cluster.
18:29:37 <dlenrow> #info answer is both requirements exist
18:32:39 <dlenrow> #info bunch of discussion establishes that the answer is very app specific
18:33:53 <dlenrow> #info Colin asks where to you want app instances and where do you want events related to go.
18:36:37 <dlenrow> #info room says we want all of the options names. Colin states that if you want all, you will get crappy performance or crappy usability
18:47:41 <dlenrow> #info question AT&T What persists across total restart
18:48:19 <dlenrow> #info answer: Was some discussion during break. Ideally we want this configurable per shard, and may also want consistency model and backstore config per shard
18:50:20 <dlenrow> #info clarification that Helium stuff is POC and intended to get us to discussing the next layer of questions/answers about what we need to build
18:52:30 <dlenrow> #info ATT What are the knobs we will be able to turn and how will we provide feedback to designers to make sure it meets needs
18:54:48 <dlenrow> #info answer: We need you to work with us to get this right. If you do testing with latest code and give feedback this will help us prioritize
18:59:30 <dlenrow> #info ATT what is deadline for input to affect Lithium planning?
19:00:12 <dlenrow> #info answer: sooner is better. No hard deadline. 4-6 weeks likely window for impact on Lithium
19:04:21 <dlenrow> #info question: does client need to know which nodes are up/down and worry about which node requests are directed to?
19:04:58 <dlenrow> #info answer: We need to supplement with load balancers and/or VRRP to deal with the changing physical address.
19:07:35 <dlenrow> #info discussion of techniques to make instance addresses transparent.
19:09:48 <dlenrow> #info last slide has contact emails and links to background info.
19:10:07 <dlenrow> #endmeeting