#opendaylight-meeting: kernel projects
Meeting started by rgoulding at 15:59:59 UTC
(full logs).
Meeting summary
- agenda bashing (rgoulding, 16:00:03)
- clustering status quo (rgoulding, 16:06:26)
- https://jira.opendaylight.org/browse/MDSAL-362
(rgoulding,
16:07:05)
- vpicard saw another occurence of this that was
just slightly different (rgoulding,
16:07:17)
- appears to be a deadlock, but slightly
different than the original one (rgoulding,
16:07:40)
- ACTION: rovarga
likely to look at this one by the end of the week (Friday
target) (rgoulding,
16:08:10)
- he has an idea why this is happening
(rgoulding,
16:08:14)
- https://jira.opendaylight.org/browse/CONTROLLER-1845
(rgoulding,
16:08:25)
- so far vpicard has not been able to reproduce
this after he added patch to do threaddump when netstat condition
met (rgoulding,
16:09:01)
- liklely to deprioritize this (rgoulding,
16:09:13)
- since it is not reproducible (rgoulding,
16:09:19)
- Jamo working on two bugs in
genius/netvirt/openflowplugin (rgoulding,
16:09:54)
- 1) unhealthy RESTCONF 401 unauthorized 2)
cluster unhealthy (rgoulding,
16:10:17)
- likely that the first issue is solved in master
and stable/oxygen (rgoulding,
16:13:02)
- https://jira.opendaylight.org/browse/CONTROLLER-1768
(rgoulding,
16:13:03)
- “Shard leaders failed to settle in 90 seconds,
giving up” (rgoulding,
16:13:25)
- happens intermittently (rgoulding,
16:13:29)
- jamoluhrsen can reproduce pretty
reliably (rgoulding,
16:13:55)
- shague asks when do we actually think we will
end up getting to tell-based? ever? (rgoulding,
16:14:41)
- it appears we are focusing on ask-based right
now (rgoulding,
16:14:54)
- the long term intention is to go to that
because it promises more resiliency, but there aren’t enough cycles
to get there in the short term (rgoulding,
16:15:17)
- jamoluhrsen asks what happens when we have java
transaction timeouts? its up to the application to do the
retries (rgoulding,
16:16:58)
- so there are timing and race conditions that
will need to be fixed (rgoulding,
16:17:09)
- faseela asks whether you should retry a
transaction when DS is unavailable (some uncertainty whether it is
AskTimeoutException or DatabaseUnavailableException) (rgoulding,
16:17:45)
- tpantelis says we may want to push towards
enabling tell based over fixing applications (rgoulding,
16:18:19)
- rovarga brings up that no one knows what
happens during ATE due to fact that it happens in 3PC and there is
inconsistency in what happens (rgoulding,
16:18:44)
- the state of the transaction is unknown (may be
committed, may not) (rgoulding,
16:18:54)
- the only way to figure this out on App side is
to do a Read and then start resyncing (rgoulding,
16:19:09)
- that is quite a bit of work that probably
shouldn’t be done from the application side (rgoulding,
16:19:18)
- rovarga states we should put forth the effort
to switch to tell-based protocol where these problems aren’t an
issue (or as much of an issue) (rgoulding,
16:19:53)
- should we lower the timeout from 30s?
(rgoulding,
16:21:47)
- rovarga says this can easily happen during
GC (rgoulding,
16:21:54)
- so be careful around making assumptions about
this since a major collection with a huge heap can take
minutes (rgoulding,
16:22:07)
- during AskTimeoutException the comm between the
backend and frontend is broken (rgoulding,
16:22:58)
- depends on application (rgoulding,
16:27:26)
- rovarga brings up the fact that the data is
replicated in the peer and can be recovered from that third
party (rgoulding,
16:27:40)
- since it can converge in a couple of seconds,
then the recovery is in ~1 minute without a ton of retry
logic (rgoulding,
16:27:54)
- mapping uint64 to BigInteger (rgoulding,
16:28:54)
- asked on mailing list (rgoulding,
16:29:03)
- anytime there is logging, counter, stats, the
conversion toString() is expensive (rgoulding,
16:29:18)
- is there a more fixed data type (yang) that
they can use for this? (rgoulding,
16:29:30)
- the recommendation is to minimize conversion
when possible (rgoulding,
16:30:15)
- and using a separate appender for logging
possibly (rgoulding,
16:31:02)
- there are a slew of types long term that will
come post-Neon that will require breaking binding-spec return
types (rgoulding,
16:31:29)
- either that or incur the cost in the binding
adapter (rgoulding,
16:31:48)
- but then everyone will pay the conversion
cost (rgoulding,
16:31:55)
- begs a two -step approach (rgoulding,
16:32:01)
- BigInteger is hard to convert to/from
(rgoulding,
16:32:08)
- https://lists.opendaylight.org/pipermail/yangtools-dev/2018-July/002264.html
(rgoulding,
16:33:02)
- to adopt this and not pay performance price,
then we will have to break everyone (will require planning)
(rgoulding,
16:33:53)
- it is a hard trade-off (rgoulding,
16:33:59)
- this may be easier to do when md-sal is
MRI (rgoulding,
16:34:24)
- modular models (rgoulding, 16:38:37)
- https://git.opendaylight.org/gerrit/#/q/topic:modular-models+(status:open+OR+status:merged)
(rovarga,
16:38:54)
- instead of odl-mdsal-models (which includes
20-25 models) now has more granular features so you can request more
specific models (rgoulding,
16:39:31)
- the idea is to kill the meta-feature afterwards
to help improve CSIT times (rgoulding,
16:39:51)
- odlparent 3.1.3 (rgoulding, 16:40:17)
- Oxygen is on 3.1.1 (rgoulding,
16:40:23)
- yangtools 2.0.5 (rgoulding,
16:40:31)
- need to roll out 2.0.7 or 2.0.8 in
oxygen (rgoulding,
16:41:15)
- some models in downstreams will need to be
fixed up using cherry-picks (rgoulding,
16:41:23)
- in order to do this we also need to adapt
odlparent 3.1.2 or 3.1.3 for upgraded guava dependencies
(rgoulding,
16:41:42)
- skitt points out also to utilize consistent
versions in our releases (rgoulding,
16:41:52)
- skitt says release notes for 3.1.3 are ready
and he is running a multi-patch build (rgoulding,
16:42:43)
- he started 4 hours ago and still hasn’t been
queued yet (rgoulding,
16:42:52)
- skitt there will be a bunch of project specific
patches to adapt 3.1.3 (rgoulding,
16:44:09)
- odlparent 4.0.0 timeline (rgoulding, 16:45:22)
- mid-august (rgoulding,
16:45:28)
- if there is stuff you want, then get it
in! (rgoulding,
16:45:55)
- it is going to include a karaf upgrade, so we
will need all the runway we can get (rgoulding,
16:46:08)
- SyncStatus stays false for more than 5minutes after bringing 2 of 3 nodes down and back up. (rgoulding, 16:46:25)
- https://jira.opendaylight.org/browse/CONTROLLER-1768
(rgoulding,
16:46:33)
- this is happening with just one node too,
according to jamoluhrsen (rgoulding,
16:46:45)
- 401 was happening before we were doing the
datastore read for AuthZ (rgoulding,
16:48:56)
- and jolokia one fixed now (rgoulding,
16:49:02)
- there should no longer be 401s (rgoulding,
16:49:11)
- luis is cointinuing to see this as of 29
minutes ago (rgoulding,
16:50:53)
- was this tried with the seed-node-timeout of
30s? (rgoulding,
16:51:08)
- tpantelis si saying that CONTROLLER-1849 401
exception may have been due to CONTROLLER-1768 and we may see more
now (rgoulding,
16:52:33)
- so lets forget 401 and focus on sync status
staying false (rgoulding,
16:53:01)
- actually, we still see 401 (rgoulding,
16:53:15)
- two questions 1) is node rejoining and has
rejoined 2) did the CDTCL come alive (rgoulding,
16:57:08)
- https://github.com/opendaylight/aaa/blob/7e7cd43a637a5b01510b0af9cac770b06d380d82/aaa-shiro/impl/src/main/resources/initial/aaa-app-config.xml#L313
(rgoulding,
16:57:58)
- ACTION: tpantelis
push patch to get rid of dynamicAuthorization (rgoulding,
16:59:36)
- will unmask the issue (rgoulding,
17:00:23)
Meeting ended at 17:02:18 UTC
(full logs).
Action items
- rovarga likely to look at this one by the end of the week (Friday target)
- tpantelis push patch to get rid of dynamicAuthorization
Action items, by person
- rovarga
- rovarga likely to look at this one by the end of the week (Friday target)
People present (lines said)
- rgoulding (88)
- odl_meetbot (4)
- rovarga (1)
Generated by MeetBot 0.1.4.