#opendaylight-devforum6: clustering and HA enhancements

Meeting started by dlenrow at 17:02:55 UTC (full logs).

Meeting summary

    1. questions about mapping cluster instances to devices and division of work.. Answer: that's out of scope for this discussion. This is about the distributed data store issues (dlenrow, 17:13:30)
    2. resource links at end of deck point to documentation for those who want to learn more (dlenrow, 17:14:12)
    3. discussion about need and feasibility of a two node cluster (dlenrow, 17:21:32)
    4. establish that most customers want N < 3 for clustering. This is a market requirement (dlenrow, 17:22:37)
    5. discussion followed about different models and limitations of two node cluster. Establshed that survivability is less certain and failure modes worse in a two-node Akka cluster. Two nodes can work with master/slave but we want a single model for apps to address (dlenrow, 17:25:06)
    6. statement made that the app designer designs the sharding strategy and can decide about this performance tradeoff. (dlenrow, 17:43:42)
    7. suggestion that cross-shard transaction can be disabled for folks who don't want to pay performance price. (dlenrow, 17:44:35)
    8. Colin D. approach is to define shards in such a way that they don't require distributed transactions and that we build bounded domains of consistency with no atte3mpt at consistency across them (dlenrow, 17:47:36)
    9. question about what is meant by programmatic shard config. (dlenrow, 17:48:23)
    10. answer is that shard config is currently read only once at startup. To add an app/shard to a running controller we need a way to update config after startup. (dlenrow, 17:49:03)
    11. Colin: we need to get apps running with shards to see where we have bottlenecks and what we need to do for optimization. (dlenrow, 17:52:08)
    12. Jan asks should we set some performance/scaling requirements for Lithium? (dlenrow, 17:56:55)
    13. Project team agrees to this (dlenrow, 17:57:20)
    14. 10 minute break (dlenrow, 17:57:34)
    15. Slide 10 Autonomous Data Replication (dlenrow, 18:17:47)
    16. one of the links to resources is to the RAFT consensus paper for those who want to learn more (dlenrow, 18:19:36)
    17. slide 12 shows evolution of features from Helium to planned Lithium (dlenrow, 18:21:16)
    18. slide 13 shows how distributed execution is made transparent by Akka services (dlenrow, 18:23:56)
    19. question: Can we use any actors external to the controller to help us identify partition and to recover. (dlenrow, 18:25:13)
    20. answer: Have looked at this idea. Not yet clear when or how to depend/support (dlenrow, 18:25:39)
    21. keith burns talks on use cases related to GBP and performance requirements. (dlenrow, 18:28:10)
    22. Colin D. asks for more clarification of the config of apps. Big distirbuted app scaled out versus single instance running in a single cluster. (dlenrow, 18:29:28)
    23. answer is both requirements exist (dlenrow, 18:29:37)
    24. bunch of discussion establishes that the answer is very app specific (dlenrow, 18:32:39)
    25. Colin asks where to you want app instances and where do you want events related to go. (dlenrow, 18:33:53)
    26. room says we want all of the options names. Colin states that if you want all, you will get crappy performance or crappy usability (dlenrow, 18:36:37)
    27. question AT&T What persists across total restart (dlenrow, 18:47:41)
    28. answer: Was some discussion during break. Ideally we want this configurable per shard, and may also want consistency model and backstore config per shard (dlenrow, 18:48:19)
    29. clarification that Helium stuff is POC and intended to get us to discussing the next layer of questions/answers about what we need to build (dlenrow, 18:50:20)
    30. ATT What are the knobs we will be able to turn and how will we provide feedback to designers to make sure it meets needs (dlenrow, 18:52:30)
    31. answer: We need you to work with us to get this right. If you do testing with latest code and give feedback this will help us prioritize (dlenrow, 18:54:48)
    32. ATT what is deadline for input to affect Lithium planning? (dlenrow, 18:59:30)
    33. answer: sooner is better. No hard deadline. 4-6 weeks likely window for impact on Lithium (dlenrow, 19:00:12)
    34. question: does client need to know which nodes are up/down and worry about which node requests are directed to? (dlenrow, 19:04:21)
    35. answer: We need to supplement with load balancers and/or VRRP to deal with the changing physical address. (dlenrow, 19:04:58)
    36. discussion of techniques to make instance addresses transparent. (dlenrow, 19:07:35)
    37. last slide has contact emails and links to background info. (dlenrow, 19:09:48)


Meeting ended at 19:10:07 UTC (full logs).

Action items

  1. (none)


People present (lines said)

  1. dlenrow (51)
  2. odl_meetbot (3)
  3. dfarrell07 (2)


Generated by MeetBot 0.1.4.