#opendaylight-meeting: odl-clustering
Meeting started by tbachman at 16:37:12 UTC
(full logs).
Meeting summary
- agenda (tbachman, 16:37:53)
- technical aspects: strengths, weaknesses; where
we are going (deployments/requirements, stable helium, lithium,
long-term) (tbachman,
16:39:56)
- team-projecect aspects: coding, integration,
testing (application, longevity) (tbachman,
16:40:18)
- alagalah asks why clustering is a feature as
opposed to a core component of the controller (tbachman,
16:40:52)
- jmedved says from a historical perspective,
the SAL is there for applications which can hide the locality of
where the applications are; clustering would be solved somewhere in
the SAL at some point later (tbachman,
16:42:13)
- after presentations on what was done with APIC
and akka, those seemed to be good concepts which should be adopted,
and these could be used inside the MD-SAL (tbachman,
16:44:21)
- so akka provides the messaging for the message
bus (tbachman,
16:44:47)
- moiz notes we should probably start with
reqiurements (tbachman,
16:46:26)
- https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit
(tbachman,
16:46:36)
- requirements (tbachman, 16:47:58)
- GBP draft scaling requirements: #link
https://wiki.opendaylight.org/view/Group_Policy:Scaling (alagalah,
16:48:23)
- https://wiki.opendaylight.org/view/Group_Policy:Scaling
<= wiki page desribing requirements for GBP (tbachman,
16:49:07)
- md-sal is a framework for building
applications (tbachman,
16:50:47)
- it has RPC, data, and notifications
(tbachman,
16:50:54)
- rovarga points out that data is really data +
notifications (tbachman,
16:51:17)
- for clustering, when making an RPC and the
instance is on a different controller, that call has to be routed to
the appropriate controller (tbachman,
16:52:10)
- the clustering arch maintains a registry of all
the service calls (tbachman,
16:52:45)
- for example, instance 1, service foo is
avilable, instance 2, service foo is available (tbachman,
16:53:12)
- gossip is used as an eventually consistent
protocol to distribute this registry (tbachman,
16:53:27)
- as a concrete example, if the service is the
openflowplugin, and there is a method, add flow (tbachman,
16:54:31)
- where switch 1 is on instance 1, and switch 2
is on instance 2 (tbachman,
16:54:49)
- so the registry would show it can provide
addflow for switch 1 on instance 1 (tbachman,
16:55:13)
- is there a mechanism to ask what services are
provided? (tbachman,
16:55:41)
- we don’t have that but could provide it
(tbachman,
16:55:47)
- there is an optimization where if the RPC is
local, it doesn’t hit this path at all (i.e. just makes the local
call directly w/o any translation) (tbachman,
16:56:41)
- from an architectural perspective, clustering
is just a layer of view (tbachman,
16:58:42)
- when designing and application, you typicaly
use the binding aware part, but internally, whatever’s stored in the
data store is in the normalized format (tbachman,
17:03:35)
- The binding aware data broker talks to the dom
data broker, which works on the DOM store (tbachman,
17:04:09)
- the DOM store has two implementations: an
in-memory store and a clustered data store (tbachman,
17:04:55)
- the current implementation uses the instance
identifier, which contains the namespace of the module, to implement
a strategy to create shards based on the namespace (tbachman,
17:06:29)
- users would write code for their sharding
strategy (tbachman,
17:07:29)
- the operational and configuration data sores
are already separated (tbachman,
17:08:10)
- folks commented that the operational data could
be kept in-memory and the configuration data could be
persisted (tbachman,
17:11:44)
- everything stored in clustered data store is
journaled and snapshots are created (tbachman,
17:13:32)
- This allows cluster restarts to have their
configuration recreated (tbachman,
17:14:03)
- moiz says that we still have to perform a test
of getting the data from disk or getting it from the leader
(tbachman,
17:20:43)
- RAFT is being used, which is a consensus
algorithm (tbachman,
17:25:56)
- instead of inventing a new algorithm, RAFT was
selected (tbachman,
17:26:10)
- moiz presents logs of a leader and a
follower (tbachman,
17:26:20)
- there’s a term in the LOG, which is an election
term, indicating that at this point, this was the leader
(tbachman,
17:26:36)
- description of RAFT algorithm ensues :)
(tbachman,
17:27:32)
- moiz notes that configurable persistence is
something that will be addressed (tbachman,
17:32:17)
- moiz says that each shard has it’s own
in-memory data store (tbachman,
17:32:31)
- so the clustered data store uses the in-memory
data store (tbachman,
17:32:56)
- the in-memory data store is the state of the
shard (tbachman,
17:34:03)
- the part that takes care of the persistence is
a layer around the in-memory data store which ensures it persists
and recovers from disk (using akka) (tbachman,
17:35:26)
- To recover, it starts from a snapshot and
replays the journal, and if no snapshot is available, it starts from
the beginning of the journal (tbachman,
17:36:17)
- snapshots are made every 20k journal entries,
and is globally configurable (tbachman,
17:36:29)
- regXboi asks if there’s a way for snapshotting
on a time-scheduled basis (tbachman,
17:37:11)
- moiz says it’s not done today, but akka
persistence supports this (tbachman,
17:37:25)
- regXboi says it would be in addition, not an
alternative (tbachman,
17:37:43)
- raghu67 notes the journal entries are also
persisted, and the snapshot is only there to speed up
recovery (tbachman,
17:38:06)
- regXboi says that anything that can be
persisted can be corrupted (tbachman,
17:38:18)
- rovarga says a snapshot is a database check
point (tbachman,
17:38:36)
- those concerned could do a full data
reconciliation of the shard (tbachman,
17:38:57)
- alagalah asks if there’s anything like
checksums for journal entries (tbachman,
17:40:08)
- moiz says we don’t currently implement
this (tbachman,
17:40:18)
- raghu67 says it’s level DB implementation, so
there may be something (tbachman,
17:40:32)
- How do we detect journal / snapshot entry
corruption. (alagalah,
17:41:00)
- rovarga says that sharding strategy is not just
how you carve up your data, but also reconciliation
guarantees (tbachman,
17:42:55)
- also how many replicas you keep, how paranoid
you are, whether you persist at all, etc. (rovarga,
17:43:46)
- rovarga says with checkpoints you can do some
sanity check, but these are trade-offs on paranoia/performance
trade-off (tbachman,
17:45:15)
- discussion was also around due to variability
in transaction size in the journal, whether SIZE should be a
snapshotting criteria, ie rather than 20k entries, if it reaches
xxxx MB, snapshot. (alagalah,
17:46:24)
- persistence has sub-topics: snapshotting,
journaling (tbachman,
17:46:47)
- replication (tbachman, 17:47:37)
- https://docs.google.com/document/d/1mVQMQQgYTMSSeTby8I-fV-W3wdkwAr-cL6LT71v5Eko/edit
<= google doc capturing some elements from the meeting (tbachman,
17:49:23)
- regXboi says that since every shard has a
single leader, there is no multi-master scenario (tbachman,
17:51:06)
- moiz says that with the MD-SAL data store, we
need to have a single master (tbachman,
17:51:29)
- regXboi says there are some interesting usage
scenarios that a single-master can’t touch (tbachman,
17:51:44)
- multi-master wasn’t addressed for clustering
for helium, as it’s more complex (tbachman,
17:52:07)
- regXboi says that once you achieve geographic
diversity, you don’t care what the sharding strategy is, and
multi-master becomes an issue (tbachman,
17:53:35)
- moiz says there are use cases for multi-master,
but it’s harder to do (tbachman,
17:53:51)
- moiz asks what are the types of things
applications will want to do (e.g. cluster aware apps?) (tbachman,
17:54:18)
- moiz says that multi master and transactions is
questionable (tbachman,
17:55:34)
- moiz says the current in-memory data store has
a single commiter, so we may have to move away from this
implementation of the in-memory data store to support
multi-master (tbachman,
17:56:43)
- to improve performance, the shards have to be
located with the shard leaders (tbachman,
18:10:45)
- this begs the question of needing notifications
when shard leaders change (tbachman,
18:11:12)
- rovarga says that tuning requires going to the
next layer of writing additional code to optimize (tbachman,
18:13:13)
- as an example, using openflowplugin with
switches, the applications may care about where the shard leader is
and move the leader if needed (tbachman,
18:14:41)
- each switch in the network can have it’s own
shard, which is colocated with the master instance that manages that
switch (tbachman,
18:18:14)
- another approach is to have a shard contain the
switches that the instance owns (tbachman,
18:19:47)
- regXboi asks how do you define that an openflow
switch is colocated with a controller instance (tbachman,
18:21:15)
- this is logical control colocation, not
physical (tbachman,
18:21:40)
- moiz continues with 3 node replication
example (tbachman,
18:22:35)
- the replicas go through an election
process (tbachman,
18:22:44)
- as soon as shard comes up, it’s a follower,
and waits to be contacted by a leader for some period of time
(tbachman,
18:23:12)
- after the timeout it becomes a candidate and
seeks votes (tbachman,
18:23:31)
- once the candidate receives a majority of
votes, it becomes a leader and sends heartbeats to all the other
nodes (tbachman,
18:23:53)
- the number of nodes is defined in
configuration, so the node knows how many votes are needed for a
majority (tbachman,
18:24:15)
- this means you can’t have a 3-node cluster with
only one node coming up and have it become the leader (tbachman,
18:25:37)
- all the transaction requests are forwarded to
the leader (tbachman,
18:25:50)
- An example case is provided with a single
leader and two followers (tbachman,
18:26:14)
- When a commit happens, the first thing the
leader does is write to the joural (tbachman,
18:27:14)
- at the same time, replicas are sent to the
followers (tbachman,
18:27:39)
- the followers then write them to their local
journals (tbachman,
18:28:00)
- each follower reports back after writing to
it’s journal, and the leader waits for the commit index to indicate
that the followers have completed (tbachman,
18:28:45)
- at this point, the leader’s in-memory data
store can be updated (tbachman,
18:29:05)
- the current in-memory data store requires that
a current transaction is completed before another transaction can be
submitted (only one transaction in can-commit and pre-commit states
at a time) (tbachman,
18:30:27)
- rovarga says that within a shard, you want
serializable consistency (tbachman,
18:32:21)
- each follower creates a new local transaction
to commit the replica to their local in-memory data store
(tbachman,
18:33:35)
- well... causal consistency is probably good
enough, but IMDS does serializable simply because
extracting/analyzing potential parallelism has comparable
performance cost as just doing the actual work (rovarga,
18:34:26)
- you need to have an event when becoming a
leader to be able to catch up (tbachman,
18:35:54)
- there is no consistency guarantee across shard
boundaries (tbachman,
18:45:28)
- If you want notifications from two different
sub-trees, they would have to be in the same shard (tbachman,
18:46:41)
- within a shard, the data change notifications
provide a serialized view of the changes (tbachman,
18:47:11)
- ghall voices concern about application writers
having to know this level of complexity, and wonders if this can be
managed using layering (tbachman,
18:52:46)
- raghu67 notes that we can support this, but
each layer becomes domain specific (tbachman,
18:56:02)
- the group is trying to determine the
notifications, etc. required of just the data store itself
(tbachman,
18:56:23)
- and layers can be placed above this that can
simplify things for application developers (tbachman,
18:56:42)
- jmedved says that remote notifications might be
needed (tbachman,
18:57:46)
- raghu67 says we could use subscription, and
deliver the notifications based on where it was registered
(tbachman,
19:02:25)
- we’re missing a notification when a follower
becomes a leader (bug/add) (tbachman,
19:04:58)
- notifications (tbachman, 20:09:22)
- moiz asks when we register ourselves to the
consumer, do we need to identify ourselves (tbachman,
20:09:53)
- and whether this is an API enhancement
(tbachman,
20:10:20)
- rovarga says you can just do QName, and that
says give all notifications (tbachman,
20:10:37)
- you can also have something more flexible, at
which point the question becomes are we going to declare it or just
define a filter (tbachman,
20:10:59)
- moiz says how would you like to get
notifications only when they’re local, and then for all
notifications (tbachman,
20:11:37)
- registerChangeListener(scope, identifier,
listener) is what we have currently (tbachman,
20:12:48)
- do we enhance this API, to be (scope, only me,
identifier, listener)? (tbachman,
20:13:17)
- ttkacik says you could have two data brokers:
local, and super shard (tbachman,
20:14:08)
- the case that we’re trying to support with this
is where the listener could be anywhere in the cluster (tbachman,
20:15:23)
- rovarga proposes an interface
DataChangeListener that has an onDataChanged() method, and have
other interfaces that extend this, such as NotifyOneOnly interfac
that implements a getIdentity() method (tbachman,
20:26:18)
- regXboi +1 to that idea (regXboi,
20:27:16)
- this makes applications be cluster aware
(tbachman,
20:27:24)
- this also allows code that works on one node to
also work on multiple nodes (tbachman,
20:28:06)
- for cluster-aware applications, we block unless
we’re the leader (tbachman,
20:36:22)
- ghall asks if the model has to be a DAG. Answer
is yes (tbachman,
20:59:23)
- ghall asks if there’s a way to write code once,
and regardless of the sharding scheme of the leafrefs, the
notifications will get back to the listener (tbachman,
21:00:18)
- rovarga says we are honoring all explicitly
stated references (tbachman,
21:01:24)
- logistics and planning (tbachman, 21:08:26)
- tpantelis asked if we need a sync up
call (tbachman,
21:08:36)
- jmedved agrees we need one (tbachman,
21:08:42)
- discussion on resources — HP, Brocade, Cisco
all looking to contribute resources (tbachman,
21:09:00)
- plan is to leverage the monday 8am PST MD-SAL
meetings for covering clustering (tbachman,
21:11:20)
- https://wiki.opendaylight.org/view/Simultaneous_Release:Lithium_Release_Plan
Proposed Lithium simultaneous release plan (tbachman,
21:25:40)
- hackers/design meetings are monday morning 8am
PST (tbachman,
21:25:49)
- possible resources include Cisco:3, HP: 1,
Noiro:1/4, Brocade: 2, Ericsson: ?, so 7+? (tbachman,
21:26:05)
- webex meetings (tbachman,
21:26:20)
- IRC channel — team may set one up (tbachman,
21:26:43)
- will put design on the wiki (tbachman,
21:26:59)
- Possibly review this on the TWS calls
(tbachman,
21:28:15)
- lithium requirements (tbachman, 21:29:32)
- hardening and performance is #1 (tbachman,
21:33:37)
- details for hardening and performance: use
streaming of NormalizedNode; configurable persistence; don’t
serialize/stream NormalizedNode when message is local (tbachman,
21:35:58)
- test bed requirements: 1 setup for 1
node-integration tests, 5 5-node cluster for testers, 1 5-node
cluster for longevity tests for 1 day, 1 5-node cluster for
longevity tests for 1 week, and 1 5-node cluster for longevity 1
month tests (tbachman,
21:46:29)
- other items: programmatic sharding and team
config (tbachman,
21:47:07)
- other items: updates to data change deliveries
(“me only”, “all listeners”) (tbachman,
21:49:25)
- other items: notifications (tbachman,
21:49:32)
- other items: finer grained sharding
(tbachman,
21:50:08)
- other items: data broker for clustered data
store (tbachman,
21:50:16)
- performance numbers: for GBP Unfiied
Communications, 240 flows/second, 100k endpoints, (tbachman,
21:54:23)
- performance numbers for GBP NFV: 10M
endpoints (tbachman,
21:57:36)
- ACTION: alagalah/tbachman to test use case numbers in data
store, and report memory usage (tbachman,
22:02:21)
- ACTION: clustering
group to ask community for performance characteristics they’re
looking for (tbachman,
22:03:06)
- maybe include reference configurations on the
wiki (tbachman,
22:05:49)
- group may schedule some hackathons for
clustering (tbachman,
22:06:21)
- other items: enhance RAFT implementation for
openflowplugin (tbachman,
22:14:03)
- ACTION: moiz and
tpantelis will create bugs for known issues (tbachman,
22:18:06)
- ACTION: jmedved to
look into hackathons (tbachman,
22:21:11)
- ACTION: alagalah to
help set up IRC channel (tbachman,
22:21:22)
- ACTION: alagalah to
work on setting up TWS call for clustering (tbachman,
22:21:59)
- ACTION: moiz to
update design on wiki (tbachman,
22:22:52)
- ACTION: jmedved to
contact phrobb to set up webex for meetings (tbachman,
22:23:54)
Meeting ended at 22:27:46 UTC
(full logs).
Action items
- alagalah/tbachman to test use case numbers in data store, and report memory usage
- clustering group to ask community for performance characteristics they’re looking for
- moiz and tpantelis will create bugs for known issues
- jmedved to look into hackathons
- alagalah to help set up IRC channel
- alagalah to work on setting up TWS call for clustering
- moiz to update design on wiki
- jmedved to contact phrobb to set up webex for meetings
Action items, by person
- alagalah
- alagalah/tbachman to test use case numbers in data store, and report memory usage
- alagalah to help set up IRC channel
- alagalah to work on setting up TWS call for clustering
- jmedved
- jmedved to look into hackathons
- jmedved to contact phrobb to set up webex for meetings
- tbachman
- alagalah/tbachman to test use case numbers in data store, and report memory usage
- UNASSIGNED
- clustering group to ask community for performance characteristics they’re looking for
- moiz and tpantelis will create bugs for known issues
- moiz to update design on wiki
People present (lines said)
- tbachman (210)
- regXboi (23)
- odl_meetbot (17)
- alagalah (7)
- gzhao (2)
- rovarga (2)
- harman_ (2)
- raghu67 (1)
- jmedved (0)
Generated by MeetBot 0.1.4.