15:59:59 #startmeeting kernel projects 15:59:59 Meeting started Tue Jul 17 15:59:59 2018 UTC. The chair is rgoulding. Information about MeetBot at http://ci.openstack.org/meetbot.html. 15:59:59 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:59:59 The meeting name has been set to 'kernel_projects' 16:00:03 #topic agenda bashing 16:06:26 #topic clustering status quo 16:07:05 #link https://jira.opendaylight.org/browse/MDSAL-362 16:07:17 #info vpicard saw another occurence of this that was just slightly different 16:07:40 #info appears to be a deadlock, but slightly different than the original one 16:08:10 #action rovarga likely to look at this one by the end of the week (Friday target) 16:08:14 #info he has an idea why this is happening 16:08:25 #link https://jira.opendaylight.org/browse/CONTROLLER-1845 16:09:01 #info so far vpicard has not been able to reproduce this after he added patch to do threaddump when netstat condition met 16:09:13 #info liklely to deprioritize this 16:09:19 #info since it is not reproducible 16:09:54 #info Jamo working on two bugs in genius/netvirt/openflowplugin 16:10:17 #info 1) unhealthy RESTCONF 401 unauthorized 2) cluster unhealthy 16:13:02 #info likely that the first issue is solved in master and stable/oxygen 16:13:03 #link https://jira.opendaylight.org/browse/CONTROLLER-1768 16:13:25 #info “Shard leaders failed to settle in 90 seconds, giving up” 16:13:29 #info happens intermittently 16:13:55 #info jamoluhrsen can reproduce pretty reliably 16:14:41 #info shague asks when do we actually think we will end up getting to tell-based? ever? 16:14:54 #info it appears we are focusing on ask-based right now 16:15:17 #info the long term intention is to go to that because it promises more resiliency, but there aren’t enough cycles to get there in the short term 16:16:58 #info jamoluhrsen asks what happens when we have java transaction timeouts? its up to the application to do the retries 16:17:09 #info so there are timing and race conditions that will need to be fixed 16:17:45 #info faseela asks whether you should retry a transaction when DS is unavailable (some uncertainty whether it is AskTimeoutException or DatabaseUnavailableException) 16:18:19 #info tpantelis says we may want to push towards enabling tell based over fixing applications 16:18:44 #info rovarga brings up that no one knows what happens during ATE due to fact that it happens in 3PC and there is inconsistency in what happens 16:18:54 #info the state of the transaction is unknown (may be committed, may not) 16:19:09 #info the only way to figure this out on App side is to do a Read and then start resyncing 16:19:18 #info that is quite a bit of work that probably shouldn’t be done from the application side 16:19:53 #info rovarga states we should put forth the effort to switch to tell-based protocol where these problems aren’t an issue (or as much of an issue) 16:21:47 #info should we lower the timeout from 30s? 16:21:54 #info rovarga says this can easily happen during GC 16:22:07 #info so be careful around making assumptions about this since a major collection with a huge heap can take minutes 16:22:58 #info during AskTimeoutException the comm between the backend and frontend is broken 16:27:26 #info depends on application 16:27:40 #info rovarga brings up the fact that the data is replicated in the peer and can be recovered from that third party 16:27:54 #info since it can converge in a couple of seconds, then the recovery is in ~1 minute without a ton of retry logic 16:28:54 #info mapping uint64 to BigInteger 16:29:03 #info asked on mailing list 16:29:18 #info anytime there is logging, counter, stats, the conversion toString() is expensive 16:29:30 #info is there a more fixed data type (yang) that they can use for this? 16:30:15 #info the recommendation is to minimize conversion when possible 16:31:02 #info and using a separate appender for logging possibly 16:31:29 #info there are a slew of types long term that will come post-Neon that will require breaking binding-spec return types 16:31:48 #info either that or incur the cost in the binding adapter 16:31:55 #info but then everyone will pay the conversion cost 16:32:01 #info begs a two -step approach 16:32:08 #info BigInteger is hard to convert to/from 16:33:02 #link https://lists.opendaylight.org/pipermail/yangtools-dev/2018-July/002264.html 16:33:53 #info to adopt this and not pay performance price, then we will have to break everyone (will require planning) 16:33:59 #info it is a hard trade-off 16:34:24 #info this may be easier to do when md-sal is MRI 16:38:37 #topic modular models 16:38:54 #link https://git.opendaylight.org/gerrit/#/q/topic:modular-models+(status:open+OR+status:merged) 16:39:31 #info instead of odl-mdsal-models (which includes 20-25 models) now has more granular features so you can request more specific models 16:39:51 #info the idea is to kill the meta-feature afterwards to help improve CSIT times 16:40:17 #topic odlparent 3.1.3 16:40:23 #info Oxygen is on 3.1.1 16:40:31 #info yangtools 2.0.5 16:41:15 #info need to roll out 2.0.7 or 2.0.8 in oxygen 16:41:23 #info some models in downstreams will need to be fixed up using cherry-picks 16:41:42 #info in order to do this we also need to adapt odlparent 3.1.2 or 3.1.3 for upgraded guava dependencies 16:41:52 #info skitt points out also to utilize consistent versions in our releases 16:42:43 #info skitt says release notes for 3.1.3 are ready and he is running a multi-patch build 16:42:52 #info he started 4 hours ago and still hasn’t been queued yet 16:43:00 #info includes munging of xtend plugin 16:43:06 #undo 16:43:06 Removing item from minutes: 16:44:09 #info skitt there will be a bunch of project specific patches to adapt 3.1.3 16:45:22 #topic odlparent 4.0.0 timeline 16:45:28 #info mid-august 16:45:55 #info if there is stuff you want, then get it in! 16:46:08 #info it is going to include a karaf upgrade, so we will need all the runway we can get 16:46:25 #topic SyncStatus stays false for more than 5minutes after bringing 2 of 3 nodes down and back up. 16:46:33 #link https://jira.opendaylight.org/browse/CONTROLLER-1768 16:46:45 #info this is happening with just one node too, according to jamoluhrsen 16:48:56 #info 401 was happening before we were doing the datastore read for AuthZ 16:49:02 #info and jolokia one fixed now 16:49:11 #info there should no longer be 401s 16:50:53 #info luis is cointinuing to see this as of 29 minutes ago 16:51:08 #info was this tried with the seed-node-timeout of 30s? 16:52:33 #info tpantelis si saying that CONTROLLER-1849 401 exception may have been due to CONTROLLER-1768 and we may see more now 16:53:01 #info so lets forget 401 and focus on sync status staying false 16:53:15 #info actually, we still see 401 16:57:08 #info two questions 1) is node rejoining and has rejoined 2) did the CDTCL come alive 16:57:58 #link https://github.com/opendaylight/aaa/blob/7e7cd43a637a5b01510b0af9cac770b06d380d82/aaa-shiro/impl/src/main/resources/initial/aaa-app-config.xml#L313 16:59:36 #action tpantelis push patch to get rid of dynamicAuthorization 17:00:23 #info will unmask the issue 17:02:18 #endmeeting