=====================================
#opendaylight-meeting: odl-clustering
=====================================


Meeting started by tbachman at 16:37:12 UTC.  The full logs are
available at
http://meetings.opendaylight.org/opendaylight-meeting/2014/odl_clustering/opendaylight-meeting-odl_clustering.2014-10-01-16.37.log.html
.


Meeting summary
---------------

* agenda  (tbachman, 16:37:53)
  * technical aspects: strengths, weaknesses; where we are going
    (deployments/requirements, stable helium, lithium, long-term)
    (tbachman, 16:39:56)
  * team-projecect aspects: coding, integration, testing (application,
    longevity)  (tbachman, 16:40:18)
  * alagalah asks why clustering is a feature as opposed to a core
    component of the controller  (tbachman, 16:40:52)
  * jmedved says from a historical perspective,  the SAL is there for
    applications which can hide the locality of where the applications
    are; clustering would be solved somewhere in the SAL at some point
    later  (tbachman, 16:42:13)
  * after presentations on what was done with APIC and akka, those
    seemed to be good concepts which should be adopted, and these could
    be used inside the MD-SAL  (tbachman, 16:44:21)
  * so akka provides the messaging for the message bus  (tbachman,
    16:44:47)
  * moiz notes we should probably start with reqiurements  (tbachman,
    16:46:26)
  * LINK:
    https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit
    (tbachman, 16:46:36)

* requirements  (tbachman, 16:47:58)
  * GBP draft scaling requirements: #link
    https://wiki.opendaylight.org/view/Group_Policy:Scaling  (alagalah,
    16:48:23)
  * LINK: https://wiki.opendaylight.org/view/Group_Policy:Scaling <=
    wiki page desribing requirements for GBP  (tbachman, 16:49:07)
  * md-sal is a framework for building applications  (tbachman,
    16:50:47)
  * it has RPC, data, and notifications  (tbachman, 16:50:54)
  * rovarga points out that data is really data + notifications
    (tbachman, 16:51:17)
  * for clustering, when making an RPC and the instance is on a
    different controller, that call has to be routed to the appropriate
    controller  (tbachman, 16:52:10)
  * the clustering arch maintains a registry of all the service calls
    (tbachman, 16:52:45)
  * for example, instance 1, service foo is avilable, instance 2,
    service foo is available  (tbachman, 16:53:12)
  * gossip is used as an eventually consistent protocol to distribute
    this registry  (tbachman, 16:53:27)
  * as a concrete example, if the service is the openflowplugin, and
    there is a method, add flow  (tbachman, 16:54:31)
  * where switch 1 is on instance 1, and switch 2 is on instance 2
    (tbachman, 16:54:49)
  * so the registry would show it can provide addflow for switch 1 on
    instance 1  (tbachman, 16:55:13)
  * is there a mechanism to ask what services are provided?  (tbachman,
    16:55:41)
  * we don’t have that but could provide it  (tbachman, 16:55:47)
  * there is an optimization where if the RPC is local, it doesn’t hit
    this path at all (i.e. just makes the local call directly w/o any
    translation)  (tbachman, 16:56:41)
  * from an architectural perspective, clustering is just a layer of
    view  (tbachman, 16:58:42)
  * when designing and application, you typicaly use the binding aware
    part, but internally, whatever’s stored in the data store is in the
    normalized format  (tbachman, 17:03:35)
  * The binding aware data broker talks to the dom data broker, which
    works on the DOM store  (tbachman, 17:04:09)
  * the DOM store has two implementations: an in-memory store and a
    clustered data store  (tbachman, 17:04:55)
  * the current implementation uses the instance identifier, which
    contains the namespace of the module, to implement a strategy to
    create shards based on the namespace  (tbachman, 17:06:29)
  * users would write code for their sharding strategy  (tbachman,
    17:07:29)
  * the operational and configuration data sores are already separated
    (tbachman, 17:08:10)
  * folks commented that the operational data could be kept in-memory
    and the configuration data could be persisted  (tbachman, 17:11:44)
  * everything stored in clustered data store is journaled and snapshots
    are created  (tbachman, 17:13:32)
  * This allows cluster restarts to have their configuration recreated
    (tbachman, 17:14:03)
  * moiz says that we still have to perform a test of getting the data
    from disk or getting it from the leader  (tbachman, 17:20:43)
  * RAFT is being used, which is a consensus algorithm  (tbachman,
    17:25:56)
  * instead of inventing a new algorithm, RAFT was selected  (tbachman,
    17:26:10)
  * moiz presents logs of a leader and a follower  (tbachman, 17:26:20)
  * there’s a term in the LOG, which is an election term, indicating
    that at this point, this was the leader  (tbachman, 17:26:36)
  * description of RAFT algorithm ensues :)  (tbachman, 17:27:32)
  * moiz notes that configurable persistence is something that will be
    addressed  (tbachman, 17:32:17)
  * moiz says that each shard has it’s own in-memory data store
    (tbachman, 17:32:31)
  * so the clustered data store uses the in-memory data store
    (tbachman, 17:32:56)
  * the in-memory data store is the state of the shard  (tbachman,
    17:34:03)
  * the part that takes care of the persistence is a layer around the
    in-memory data store which ensures it persists and recovers from
    disk (using akka)  (tbachman, 17:35:26)
  * To recover, it starts from a snapshot and replays the journal, and
    if no snapshot is available, it starts from the beginning of the
    journal  (tbachman, 17:36:17)
  * snapshots are made every 20k journal entries, and is globally
    configurable  (tbachman, 17:36:29)
  * regXboi asks if there’s a way for snapshotting on a time-scheduled
    basis  (tbachman, 17:37:11)
  * moiz says it’s not done today, but akka persistence supports this
    (tbachman, 17:37:25)
  * regXboi says it would be in addition, not an alternative  (tbachman,
    17:37:43)
  * raghu67 notes the journal entries are also persisted, and the
    snapshot is only there to speed up recovery  (tbachman, 17:38:06)
  * regXboi says that anything that can be persisted can be corrupted
    (tbachman, 17:38:18)
  * rovarga says a snapshot is a database check point  (tbachman,
    17:38:36)
  * those concerned could do a full data reconciliation of the shard
    (tbachman, 17:38:57)
  * alagalah asks if there’s anything like checksums for journal entries
    (tbachman, 17:40:08)
  * moiz says we don’t currently implement this  (tbachman, 17:40:18)
  * raghu67 says it’s level DB implementation, so there may be something
    (tbachman, 17:40:32)
  * How do we detect journal / snapshot entry corruption.  (alagalah,
    17:41:00)
  * rovarga says that sharding strategy is not just how you carve up
    your data, but also reconciliation guarantees  (tbachman, 17:42:55)
  * also how many replicas you keep, how paranoid you are, whether you
    persist at all, etc.  (rovarga, 17:43:46)
  * rovarga says with checkpoints you can do some sanity check, but
    these are trade-offs on paranoia/performance trade-off  (tbachman,
    17:45:15)
  * discussion was also around due to variability in transaction size in
    the journal, whether SIZE should be a snapshotting criteria, ie
    rather than 20k entries, if it reaches xxxx MB, snapshot.
    (alagalah, 17:46:24)
  * persistence has sub-topics: snapshotting, journaling  (tbachman,
    17:46:47)

* replication  (tbachman, 17:47:37)
  * LINK:
    https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit
    <= google doc capturing some elements from the meeting  (tbachman,
    17:49:23)
  * regXboi says that since every shard has a single leader, there is no
    multi-master scenario  (tbachman, 17:51:06)
  * moiz says that with the MD-SAL data store, we need to have a single
    master  (tbachman, 17:51:29)
  * regXboi says there are some interesting usage scenarios that a
    single-master can’t touch  (tbachman, 17:51:44)
  * multi-master wasn’t addressed for clustering for helium, as it’s
    more complex  (tbachman, 17:52:07)
  * regXboi says that once you achieve geographic diversity, you don’t
    care what the sharding strategy is, and multi-master becomes an
    issue  (tbachman, 17:53:35)
  * moiz says there are use cases for multi-master, but it’s harder to
    do  (tbachman, 17:53:51)
  * moiz asks what are the types of things applications will want to do
    (e.g. cluster aware apps?)  (tbachman, 17:54:18)
  * moiz says that multi master and transactions is questionable
    (tbachman, 17:55:34)
  * moiz says the current in-memory data store has a single commiter, so
    we may have to move away from this implementation of the in-memory
    data store to support multi-master  (tbachman, 17:56:43)
  * to improve performance, the shards have to be located with the shard
    leaders  (tbachman, 18:10:45)
  * this begs the question of needing notifications when shard leaders
    change  (tbachman, 18:11:12)
  * rovarga says that tuning requires going to the next layer of writing
    additional code to optimize  (tbachman, 18:13:13)
  * as an example, using openflowplugin with switches, the applications
    may care about where the shard leader is and move the leader if
    needed  (tbachman, 18:14:41)
  * each switch in the network can have it’s own shard, which is
    colocated with the master instance that manages that switch
    (tbachman, 18:18:14)
  * another approach is to have a shard contain the switches that the
    instance owns  (tbachman, 18:19:47)
  * regXboi asks how do you define that an openflow switch is colocated
    with a controller instance  (tbachman, 18:21:15)
  * this is logical control colocation, not physical  (tbachman,
    18:21:40)
  * moiz continues with 3 node replication example  (tbachman, 18:22:35)
  * the replicas go through an election process  (tbachman, 18:22:44)
  * as soon as  shard comes up, it’s a follower, and waits to be
    contacted by a leader for some period of time  (tbachman, 18:23:12)
  * after the timeout it becomes a candidate and seeks votes  (tbachman,
    18:23:31)
  * once the candidate receives a majority of votes, it becomes a leader
    and sends heartbeats to all the other nodes  (tbachman, 18:23:53)
  * the number of nodes is defined in configuration, so the node knows
    how many votes are needed for a majority  (tbachman, 18:24:15)
  * this means you can’t have a 3-node cluster with only one node coming
    up and have it become the leader  (tbachman, 18:25:37)
  * all the transaction requests are forwarded to the leader  (tbachman,
    18:25:50)
  * An example case is provided with a single leader and two followers
    (tbachman, 18:26:14)
  * When a commit happens, the first thing the leader does is write to
    the joural  (tbachman, 18:27:14)
  * at the same time, replicas are sent to the followers  (tbachman,
    18:27:39)
  * the followers then write them to their local journals  (tbachman,
    18:28:00)
  * each follower reports back after writing to it’s journal, and the
    leader waits for the commit index to indicate that the followers
    have completed  (tbachman, 18:28:45)
  * at this point, the leader’s in-memory data store can be updated
    (tbachman, 18:29:05)
  * the current in-memory data store requires that a current transaction
    is completed before another transaction can be submitted (only one
    transaction in can-commit and pre-commit states at a time)
    (tbachman, 18:30:27)
  * rovarga says that within a shard, you want serializable consistency
    (tbachman, 18:32:21)
  * each follower creates a new local transaction to commit the replica
    to their local in-memory data store  (tbachman, 18:33:35)
  * well... causal consistency is probably good enough, but IMDS does
    serializable simply because extracting/analyzing potential
    parallelism has comparable performance cost as just doing the actual
    work  (rovarga, 18:34:26)
  * you need to have an event when becoming a leader to be able to catch
    up  (tbachman, 18:35:54)
  * there is no consistency guarantee across shard boundaries
    (tbachman, 18:45:28)
  * If you want notifications from two different sub-trees, they would
    have to be in the same shard  (tbachman, 18:46:41)
  * within a shard, the data change notifications provide a serialized
    view of the changes  (tbachman, 18:47:11)
  * ghall voices concern about application writers having to know this
    level of complexity, and wonders if this can be managed using
    layering  (tbachman, 18:52:46)
  * raghu67 notes that we can support this, but each layer becomes
    domain specific  (tbachman, 18:56:02)
  * the group is trying to determine the notifications, etc. required of
    just the data store itself  (tbachman, 18:56:23)
  * and layers can be placed above this that can simplify things for
    application developers  (tbachman, 18:56:42)
  * jmedved says that remote notifications might be needed  (tbachman,
    18:57:46)
  * raghu67 says we could use subscription, and deliver the
    notifications based on where it was registered  (tbachman, 19:02:25)
  * we’re missing a notification when a follower becomes a leader
    (bug/add)  (tbachman, 19:04:58)

* notifications  (tbachman, 20:09:22)
  * moiz asks when we register ourselves to the consumer, do we need to
    identify ourselves  (tbachman, 20:09:53)
  * and whether this is an API enhancement  (tbachman, 20:10:20)
  * rovarga says you can just do QName, and that says give all
    notifications  (tbachman, 20:10:37)
  * you can also have something more flexible, at which point the
    question becomes are we going to declare it or just define a filter
    (tbachman, 20:10:59)
  * moiz says how would you like to get notifications only when they’re
    local, and then for all notifications  (tbachman, 20:11:37)
  * registerChangeListener(scope, identifier, listener) is what we have
    currently  (tbachman, 20:12:48)
  * do we enhance this API, to be (scope, only me, identifier,
    listener)?  (tbachman, 20:13:17)
  * ttkacik says you could have two data brokers: local, and super shard
    (tbachman, 20:14:08)
  * the case that we’re trying to support with this is where the
    listener could be anywhere in the cluster  (tbachman, 20:15:23)
  * rovarga proposes an interface DataChangeListener that has an
    onDataChanged() method, and have other interfaces that extend this,
    such as NotifyOneOnly interfac that implements a getIdentity()
    method  (tbachman, 20:26:18)
  * regXboi +1 to that idea  (regXboi, 20:27:16)
  * this makes applications be cluster aware  (tbachman, 20:27:24)
  * this also allows code that works on one node to also work on
    multiple nodes  (tbachman, 20:28:06)
  * for cluster-aware applications, we block unless we’re the leader
    (tbachman, 20:36:22)
  * ghall asks if the model has to be a DAG. Answer is yes  (tbachman,
    20:59:23)
  * ghall asks if there’s a way to write code once, and regardless of
    the sharding scheme of the leafrefs, the notifications will get back
    to the listener  (tbachman, 21:00:18)
  * rovarga says we are honoring all explicitly stated references
    (tbachman, 21:01:24)

* logistics and planning  (tbachman, 21:08:26)
  * tpantelis asked if we need a sync up call  (tbachman, 21:08:36)
  * jmedved agrees we need one  (tbachman, 21:08:42)
  * discussion on resources — HP, Brocade, Cisco all looking to
    contribute resources  (tbachman, 21:09:00)
  * plan is to leverage the monday 8am PST MD-SAL meetings for covering
    clustering  (tbachman, 21:11:20)
  * LINK:
    https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan
    Proposed Lithium simultaneous release plan  (tbachman, 21:25:40)
  * hackers/design meetings are monday morning 8am PST  (tbachman,
    21:25:49)
  * possible resources include Cisco:3, HP: 1, Noiro:1/4, Brocade: 2,
    Ericsson: ?, so 7+?  (tbachman, 21:26:05)
  * webex meetings  (tbachman, 21:26:20)
  * IRC channel — team may set one up  (tbachman, 21:26:43)
  * will put design on the wiki  (tbachman, 21:26:59)
  * Possibly review this on the TWS calls  (tbachman, 21:28:15)

* lithium requirements  (tbachman, 21:29:32)
  * hardening and performance is #1  (tbachman, 21:33:37)
  * details for hardening and performance: use streaming of
    NormalizedNode; configurable persistence; don’t serialize/stream
    NormalizedNode when message is local  (tbachman, 21:35:58)
  * test bed requirements: 1 setup for 1 node-integration tests, 5
    5-node cluster for testers, 1 5-node cluster for longevity tests for
    1 day, 1 5-node cluster for longevity tests for 1 week, and 1 5-node
    cluster for longevity 1 month tests  (tbachman, 21:46:29)
  * other items: programmatic sharding and team config  (tbachman,
    21:47:07)
  * other items: updates to data change deliveries (“me only”, “all
    listeners”)  (tbachman, 21:49:25)
  * other items: notifications  (tbachman, 21:49:32)
  * other items: finer grained sharding  (tbachman, 21:50:08)
  * other items: data broker for clustered data store  (tbachman,
    21:50:16)
  * performance numbers: for GBP Unfiied Communications, 240
    flows/second, 100k endpoints,  (tbachman, 21:54:23)
  * performance numbers for GBP NFV: 10M endpoints  (tbachman, 21:57:36)
  * ACTION: alagalah/tbachman to test use case numbers in data store,
    and report memory usage  (tbachman, 22:02:21)
  * ACTION: clustering group to ask community for performance
    characteristics they’re looking for  (tbachman, 22:03:06)
  * maybe include reference configurations on the wiki  (tbachman,
    22:05:49)
  * group may schedule some hackathons for clustering  (tbachman,
    22:06:21)
  * other items: enhance RAFT implementation for openflowplugin
    (tbachman, 22:14:03)
  * ACTION: moiz and tpantelis will create bugs for known issues
    (tbachman, 22:18:06)
  * ACTION: jmedved to look into hackathons  (tbachman, 22:21:11)
  * ACTION: alagalah to help set up IRC channel  (tbachman, 22:21:22)
  * ACTION: alagalah to work on setting up TWS call for clustering
    (tbachman, 22:21:59)
  * ACTION: moiz to update design on wiki  (tbachman, 22:22:52)
  * ACTION: jmedved to contact phrobb to set up webex for meetings
    (tbachman, 22:23:54)


Meeting ended at 22:27:46 UTC.


Action items, by person
-----------------------

* alagalah
  * alagalah/tbachman to test use case numbers in data store, and report
    memory usage
  * alagalah to help set up IRC channel
  * alagalah to work on setting up TWS call for clustering
* jmedved
  * jmedved to look into hackathons
  * jmedved to contact phrobb to set up webex for meetings
* tbachman
  * alagalah/tbachman to test use case numbers in data store, and report
    memory usage
* **UNASSIGNED**
  * clustering group to ask community for performance characteristics
    they’re looking for
  * moiz and tpantelis will create bugs for known issues
  * moiz to update design on wiki


People present (lines said)
---------------------------

* tbachman (210)
* regXboi (23)
* odl_meetbot (17)
* alagalah (7)
* gzhao (2)
* rovarga (2)
* harman_ (2)
* raghu67 (1)
* jmedved (0)


Generated by `MeetBot`_ 0.1.4