============================================
#opendaylight-clustering: clustering_hackers
============================================


Meeting started by tbachman at 16:02:12 UTC.  The full logs are
available at
http://meetings.opendaylight.org/opendaylight-clustering/2015/clustering_hackers/opendaylight-clustering-clustering_hackers.2015-02-03-16.02.log.html
.


Meeting summary
---------------

* agenda  (tbachman, 16:02:20)
  * LINK:
    https://meetings.opendaylight.org/opendaylight-clustering/2015/_clustering_hackers_/opendaylight-clustering-_clustering_hackers_.2015-01-20-17.25.html
    Last recorded meeting minutes  (tbachman, 16:02:46)
  * LINK:
    https://cisco.webex.com/ciscosales/e.php?MTID=mbd68cebe2f65f57084d63329e5e49e26
    (moizer, 16:02:49)

* ongoing work  (tbachman, 16:12:21)
  * pantelis says that harmon has the heartbeat development — feels that
    we need to think about that some more  (tbachman, 16:12:49)
  * moizer asks what we need to think about with heartbeating
    (tbachman, 16:13:04)
  * pantelis says that send/append re-evaluates the followers state
    (tbachman, 16:13:24)
  * pantelis says that’s the piece that sends the next snapshot, too
    (tbachman, 16:13:36)
  * pantelis wonders why it wasn’t sent in the reply, and is wary of
    removing the send/append entries  (tbachman, 16:13:53)
  * pantelis says that for a replicate, we send it out to all the
    followers, who then persist the data, but based on the reader’s
    commit, they don’t necesarilly apply the log to the state machine
    until the leader gets consensus back and then commits  (tbachman,
    16:14:34)
  * That’s what causes the last log entry to apply to the state machine
    to the followers  (tbachman, 16:14:51)
  * moizer says that this is how the algorithm states it  (tbachman,
    16:15:00)
  * pantelis says it takes 2 append entries to get this to the data
    store  (tbachman, 16:15:13)
  * moizer says we don’t want to lose the heartbeat. Previously, we had
    a heartbeat timeout of 500 and an election timeout of double that
    (tbachman, 16:15:37)
  * moizer says that when you have followers who are lagging behind,
    then the 5 second heartbeat will heart you  (tbachman, 16:15:53)
  * moizer has made this factor-configurable, and saw some research that
    used a 20x timeout for elections  (tbachman, 16:16:18)
  * moizer says the variance is a random interval for the election
    timeout. If your election timeout is 1 second, then this is between
    1 and 1.2 seconds.  (tbachman, 16:17:14)
  * This minimizes clashes, and minimize interval to when various
    candidates wake up  (tbachman, 16:17:59)
  * moizer says we can have an optimization for when the append-entries
    is received, rather than waiting for the heartbeat  (tbachman,
    16:19:38)
  * pantelis says we have to redo the snapshot chunking then  (tbachman,
    16:19:58)
  * moizer says that we can do the same thing — when you get the reply,
    you can do the snapshot chunk  (tbachman, 16:20:19)
  * pantelis says what if we have a separate heartbeat actor that just
    sends the heartbeat message with no data  (tbachman, 16:21:21)
  * moizer says the heartbeat message has the current term of the leader
    and the follower index  (tbachman, 16:21:37)
  * It needs to get the append-entries from the current leader
    (tbachman, 16:21:51)
  * pantelis says the only time we send a new index is on a replicate
    (tbachman, 16:24:41)
  * moizer says when there’s a new follower, there’s a lot of entries
    that need to be replicated  (tbachman, 16:24:55)
  * when the heartbeat reply is received from a new follower, then we
    have to send the follower a bunch of updates  (tbachman, 16:25:22)
  * If there’s a replicate and the message gets lost, there is no way to
    send the same message  (tbachman, 16:26:30)
  * pantelis says when you have 2 followers and you replicate, one
    follower responds first, we don’t need to reply to the second
    follower to get consensus; if there’s a commit immediately after
    that, then the previous entry gets sent twice to the same follower,
    which is okay, but just  not as efficient  (tbachman, 16:27:20)
  * moizer produced a very small simulator where you can run with a real
    controller, which can help to track down replication issues
    (tbachman, 16:33:43)
  * you can connect mininet to the controller, and it tries to replicate
    the data; the simulator provides the acknowledgement  (tbachman,
    16:34:05)
  * moizer will try to check this in sometime today  (tbachman,
    16:34:21)
  * Vamsi asks why we’re doing gerrit 14658  (tbachman, 16:35:12)
  * moizer says we’re doing this b/c timeouts happen  (tbachman,
    16:35:36)
  * Vamsi asks what the cause for the loss of the heartbeat message
    (tbachman, 16:36:16)
  * moizer says he doesn’t know — network delays can cause it to arrive
    late, or happens due to some sort of partitioning  (tbachman,
    16:36:40)
  * pantelis says spurrious re-elections is either the shard actor is
    busy (e.g. processing a very large pre-commit, with akka only
    processing one message at a time), so no heartbeat goes out, causing
    a follower to re-elect; the other is garbage collection and thread
    context switching latencies  (tbachman, 16:38:03)
  * pantelis says that in bigger clusers (7, 9, 11, etc. nodes), then
    you’ll have a lot more traffic coming out  (tbachman, 16:40:37)
  * rovarga says we need to think of 100’s of shards as the default
    scale factor; if we’re looking at say 2k switches in a DC, the
    heartbeat chatter may be prohibitive  (tbachman, 16:41:38)
  * rovarga says we should default the Java garbage collector to g1gc
    for clustering  (tbachman, 16:42:01)
  * moizer says we produce a lot of garbage, and g1gc has a specific
    amount of time it spends on GC, it allows the heap to grow
    (tbachman, 16:42:52)
  * rovarga says this is a triangle; one is the heap size; the
    occurrence of GC; and the average time the GC takes. You have to
    move within that triangle  (tbachman, 16:43:26)
  * rovarga with the current config, you may run okay for a certain
    amount of time, but eventually hit a wall  (tbachman, 16:43:47)
  * moizer says he’s worried that it doesn’t actually collect
    (tbachman, 16:43:56)
  * moizer says he’s observed that you run out of memory faster
    (tbachman, 16:44:06)
  * rovarga says he was running the in-memory data store and was running
    BGP with 1M routes, and it almost ran out of heap; the trace showed
    ~3.9GB of heap used in oldgen, and 10 second pause where collections
    were happening like crazy, and took this down to ~.5GB  (tbachman,
    16:45:33)
  * pantelis asks about the shard logging; why are we using the akka
    logger, as it doesn’t preserve the line number (i.e. it outputs the
    line number of the actor instead)  (tbachman, 16:47:31)
  * moizer says the logging adapter is used just to make sure it’s
    asychronous  (tbachman, 16:48:01)
  * pantelis says that karaf does that anyway (pax-logging to an OSGI
    service)  (tbachman, 16:48:15)
  * pantelis says that when you do a log.error, and you have formatting
    arguments and you also want to print the exception, with logsf4j you
    have to do a string format b/c if you pass in e as the last
    argument, it won’t format it correctly, which the akka logger will
    (tbachman, 16:50:56)
  * rovarga says if you don’t mention it in the string formatting, it
    will pick it up as an exception formatting  (tbachman, 16:51:19)
  * LOGGER.warn("Foo {}", obj, ex);  (rovarga, 16:51:49)
  * pantelis asks if it’s okay to migrate logsf4j logging  (tbachman,
    16:52:13)
  * moizer says there could be a lot of changes there  (tbachman,
    16:52:23)
  * pantelis says he’s willing to make these changes  (tbachman,
    16:52:32)
  * moizer says there’s a way to have the shard identifier as well; once
    we move to logsf4j, we lose that ability as well  (tbachman,
    16:53:01)
  * ACTION: pantelis to create a bug to address logging  (tbachman,
    16:55:21)
  * moizer is pushing a patch today for backpressure for creation of a
    transaction  (tbachman, 16:56:16)
  * moizer has seen a problem with statistics collection, and in a
    multi-node cluster, this takes a long time for the commit to go
    through; the statistics manager continues to try pushing these
    through, and it eventually times out  (tbachman, 16:57:04)
  * rovarga says openflow doesn’t have a single writer per data tree —
    will be addressed in new openflow design  (tbachman, 16:58:00)
  * moizer says there are 2 cases requiring backpressure; the BGP case,
    where there’s a single transaction with multiple data; the other is
    stats manager where there is a new transaction per data  (tbachman,
    16:59:17)
  * moizer wants to be able to apply a transaction to a state without
    consensus for operational date  (tbachman, 17:03:54)
  * moizer asks for thoughts on doing that  (tbachman, 17:04:05)
  * pantelis asks if that breaks RAFT  (tbachman, 17:04:11)
  * moizer says that for operational data, we already said we break RAFT
    by not being persistent  (tbachman, 17:04:31)
  * pantelis says that the assumption is that operational data can be
    recalculated  (tbachman, 17:04:51)
  * moizer says as soon as the leader gets the commit, they instantly
    try to replicate the data  (tbachman, 17:05:06)
  * rovarga says he’s not familiar enough with RAFT to know for sure
    yet; sounds a bit scary, but asks what it is that the applications
    expect, and what is inovlved in reproducing the operational data
    (tbachman, 17:05:53)
  * rovarga says that some applications might see a failover rather than
    a graceful migration, in which case the applications might reproduce
    the data somehow  (tbachman, 17:06:23)
  * moizer says we can have another flag for this  (tbachman, 17:06:58)
  * pantelis asks if we turn it on by default  (tbachman, 17:07:04)
  * moizer says no, and we have to see how this works first  (tbachman,
    17:07:14)
  * moizer says that our data store is more of a strong consistency data
    store  (tbachman, 17:07:50)
  * the operational data store has things that change very rapidly —
    this makes for an eventually consistency model, which allows for
    better performance  (tbachman, 17:08:11)
  * pantelis asks if we should be putting Time Series Data in the data
    store  (tbachman, 17:08:33)
  * Vamsi is deprioritizing the 2-node cluster in favor of stablizing
    basic clustering  (tbachman, 17:11:06)
  * moizer asks if HP is planning to submit any patches  (tbachman,
    17:11:15)
  * Vamsi says they are looking at the order that they will start
    contributing  (tbachman, 17:11:28)
  * moizer says the best thing they can do is to report issues you find
    in bugzilla, to ensure that we don’t duplicate the work  (tbachman,
    17:11:45)
  * pantelis says there’s another meeting on Thursday the mark had set
    up;  (tbachman, 17:12:00)
  * moizer says that Dell wants to continue, but HP doesn’t  (tbachman,
    17:12:09)
  * LINK: https://bugs.opendaylight.org/show_bug.cgi?id=2667  (tbachman,
    17:12:55)
  * LINK: https://bugs.opendaylight.org/show_bug.cgi?id=2667 bug
    reported by GBP  (tbachman, 17:13:35)


Meeting ended at 17:37:20 UTC.


People present (lines said)
---------------------------

* tbachman (97)
* moizer (5)
* odl_meetbot (4)
* rovarga (2)


Generated by `MeetBot`_ 0.1.4