#opendaylight-devforum6: clustering and HA enhancements

Meeting started by dlenrow at 17:02:55 UTC (full logs).

Meeting summary

1. questions about mapping cluster instances to devices and division of work.. Answer: that's out of scope for this discussion. This is about the distributed data store issues (dlenrow, 17:13:30)
2. resource links at end of deck point to documentation for those who want to learn more (dlenrow, 17:14:12)
3. discussion about need and feasibility of a two node cluster (dlenrow, 17:21:32)
4. establish that most customers want N < 3 for clustering. This is a market requirement (dlenrow, 17:22:37)
5. discussion followed about different models and limitations of two node cluster. Establshed that survivability is less certain and failure modes worse in a two-node Akka cluster. Two nodes can work with master/slave but we want a single model for apps to address (dlenrow, 17:25:06)
6. statement made that the app designer designs the sharding strategy and can decide about this performance tradeoff. (dlenrow, 17:43:42)
7. suggestion that cross-shard transaction can be disabled for folks who don't want to pay performance price. (dlenrow, 17:44:35)
8. Colin D. approach is to define shards in such a way that they don't require distributed transactions and that we build bounded domains of consistency with no atte3mpt at consistency across them (dlenrow, 17:47:36)
9. question about what is meant by programmatic shard config. (dlenrow, 17:48:23)
10. answer is that shard config is currently read only once at startup. To add an app/shard to a running controller we need a way to update config after startup. (dlenrow, 17:49:03)
11. Colin: we need to get apps running with shards to see where we have bottlenecks and what we need to do for optimization. (dlenrow, 17:52:08)
12. Jan asks should we set some performance/scaling requirements for Lithium? (dlenrow, 17:56:55)
13. Project team agrees to this (dlenrow, 17:57:20)
14. 10 minute break (dlenrow, 17:57:34)
15. Slide 10 Autonomous Data Replication (dlenrow, 18:17:47)
16. one of the links to resources is to the RAFT consensus paper for those who want to learn more (dlenrow, 18:19:36)
17. slide 12 shows evolution of features from Helium to planned Lithium (dlenrow, 18:21:16)
18. slide 13 shows how distributed execution is made transparent by Akka services (dlenrow, 18:23:56)
19. question: Can we use any actors external to the controller to help us identify partition and to recover. (dlenrow, 18:25:13)
20. answer: Have looked at this idea. Not yet clear when or how to depend/support (dlenrow, 18:25:39)
21. keith burns talks on use cases related to GBP and performance requirements. (dlenrow, 18:28:10)
22. Colin D. asks for more clarification of the config of apps. Big distirbuted app scaled out versus single instance running in a single cluster. (dlenrow, 18:29:28)
23. answer is both requirements exist (dlenrow, 18:29:37)
24. bunch of discussion establishes that the answer is very app specific (dlenrow, 18:32:39)
25. Colin asks where to you want app instances and where do you want events related to go. (dlenrow, 18:33:53)
26. room says we want all of the options names. Colin states that if you want all, you will get crappy performance or crappy usability (dlenrow, 18:36:37)
27. question AT&T What persists across total restart (dlenrow, 18:47:41)
28. answer: Was some discussion during break. Ideally we want this configurable per shard, and may also want consistency model and backstore config per shard (dlenrow, 18:48:19)
29. clarification that Helium stuff is POC and intended to get us to discussing the next layer of questions/answers about what we need to build (dlenrow, 18:50:20)
30. ATT What are the knobs we will be able to turn and how will we provide feedback to designers to make sure it meets needs (dlenrow, 18:52:30)
31. answer: We need you to work with us to get this right. If you do testing with latest code and give feedback this will help us prioritize (dlenrow, 18:54:48)
32. ATT what is deadline for input to affect Lithium planning? (dlenrow, 18:59:30)
33. answer: sooner is better. No hard deadline. 4-6 weeks likely window for impact on Lithium (dlenrow, 19:00:12)
34. question: does client need to know which nodes are up/down and worry about which node requests are directed to? (dlenrow, 19:04:21)
35. answer: We need to supplement with load balancers and/or VRRP to deal with the changing physical address. (dlenrow, 19:04:58)
36. discussion of techniques to make instance addresses transparent. (dlenrow, 19:07:35)
37. last slide has contact emails and links to background info. (dlenrow, 19:09:48)

Meeting ended at 19:10:07 UTC (full logs).

Action items

(none)

People present (lines said)

dlenrow (51)
odl_meetbot (3)
dfarrell07 (2)

Generated by MeetBot 0.1.4.