15:00:47 <yamahata> #startmeeting neutron_northbound
15:00:47 <odl_meetbot> Meeting started Mon Jul 24 15:00:47 2017 UTC.  The chair is yamahata. Information about MeetBot at http://ci.openstack.org/meetbot.html.
15:00:47 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:47 <odl_meetbot> The meeting name has been set to 'neutron_northbound'
15:01:00 <yamahata> #chair mkolesni rajivk_
15:01:00 <odl_meetbot> Current chairs: mkolesni rajivk_ yamahata
15:01:06 <yamahata> #topic agenda bashing and roll call
15:01:10 <mkolesni> #info mkolesni
15:01:15 <yamahata> #info yamahata
15:01:22 <yamahata> #link https://wiki.opendaylight.org/view/NeutronNorthbound:Meetings
15:01:24 <rajivk_> #info rajivk
15:01:45 <yamahata> any topics in addition to breakage and usual patches/bugs?
15:02:23 <mkolesni> id like to discuss the ci
15:02:36 <mkolesni> not the u/t, the tempest
15:02:45 <yamahata> yeah, now tempest ci is not in good shape.
15:03:07 <mkolesni> well, were making it better but its a slow process
15:03:31 <yamahata> anything else?
15:04:15 <yamahata> ok move on
15:04:15 <mkolesni> FF is thursday
15:04:23 <mkolesni> we need to merge all non-bugs by then
15:04:34 <mkolesni> yamahata, are you cutting the branch?
15:04:49 <mkolesni> or is it nor automatic since wer'e not in independent release model
15:04:56 <yamahata> neutron team will do with one patch.
15:05:07 <yamahata> Pike-2 was done so.
15:05:17 <yamahata> So we'll review such patch
15:05:33 <mkolesni> ok
15:05:37 <yamahata> #topic Announcements
15:05:43 <yamahata> pike-3 is this week.
15:05:44 <mkolesni> afaik its 27th
15:05:52 <mkolesni> so that leaves ~3 days
15:05:53 <yamahata> #info Feature freeze is thursday
15:06:12 <yamahata> any other announcement?
15:06:31 <mkolesni> do you know if you're going to ptg yet?
15:06:34 <mkolesni> or the summit?
15:06:39 <yamahata> Unfortunately not yet.
15:06:44 <mkolesni> ok
15:06:52 <mkolesni> i will ask again next week :)
15:07:07 <yamahata> #topic action items from last meeting
15:07:17 <yamahata> I suppose we don't have any. (except patch review)
15:07:21 <yamahata> #topic Pike/Nitrogen planning
15:07:27 <mkolesni> rajivk_'s patch is good to go but blocked by the ci breakage :/
15:07:46 <yamahata> So for Pike-3 feature patches needs to be merged
15:07:59 <yamahata> #action everyone address ci breakage
15:08:20 <mkolesni> rajivk_ mentioned it earlier
15:08:31 <mkolesni> rajivk_, do you know the necessary fix for the u/t ci?
15:08:45 <rajivk_> i will put a patch.
15:09:09 <rajivk_> But i dont know, why ci were passing after ceilometer patch got merged.
15:09:11 <mkolesni> ok great i havent had time to look at it today so if you have the fix we'll review it
15:09:18 <rajivk_> May be my findings are not correct.
15:09:38 <mkolesni> well, post the patch and we'll see :)
15:09:49 <rajivk_> ok
15:09:54 <yamahata> yeah, we'll see the result.
15:10:16 <yamahata> So what are the remaining patches?
15:10:26 <yamahata> https://review.openstack.org/#/c/474851/
15:10:46 <yamahata> https://review.openstack.org/#/q/topic:bug/1683797
15:11:04 <yamahata> Oh mkolesni you uploaded a patch to make it neutron worker.
15:11:07 <yamahata> great
15:11:17 <mkolesni> yamahata, yes i think its a more elegant approach
15:11:31 <mkolesni> also it will allow to configure multiple workers if we have a need for it
15:11:36 <yamahata> and dhcp patch
15:11:46 <yamahata> https://review.openstack.org/#/c/465735/
15:12:06 <yamahata> For dhcp port patch, it would need review.
15:12:20 <mkolesni> i will review it again tomorrow
15:12:27 <mkolesni> yamahata, if you agree with https://review.openstack.org/486606
15:12:43 <mkolesni> perhaps we can abandon all the other ones on the same bug
15:12:55 <yamahata> I haven't reviewed the patch yet. But that's what I'd like to cook it.
15:13:10 <yamahata> I think thread pooling still make sense.
15:13:19 <yamahata> It's orthogonal to 486606.
15:13:20 <mkolesni> sure just saying there's lot of patches there now
15:13:37 <mkolesni> no problem with that though nobody addressed my comment there from PS5
15:13:49 <yamahata> prepopulate agentdb patches are floating around.
15:14:03 <yamahata> It's bug fix patch, though.
15:14:21 <yamahata> https://review.openstack.org/#/c/465735/ and https://review.openstack.org/#/c/484446/
15:14:26 <mkolesni> all bug fixes aren't first priority so lets focus on the features first
15:14:34 <yamahata> Yeah.
15:14:42 <mkolesni> and if we have time left on the bug fixes
15:14:46 <yamahata> we have plenty of patches for Pike-3...
15:15:03 <mkolesni> rajivk_, yamahata please see my comment here https://review.openstack.org/#/c/452647/5/networking_odl/journal/journal.py
15:15:06 <yamahata> After Pike-3, we can address bug fixes
15:15:35 <mkolesni> of course then neutron stable team has to approve the backports to stable/pike right?
15:15:49 <yamahata> right.
15:16:01 <manjeets> hello
15:16:08 <mkolesni> btw thread pooling is also a feature so if you want it in we can focus on it too
15:16:23 <mkolesni> though i dont think its critical for Pike and could slip to Queens
15:17:23 <yamahata> https://review.openstack.org/#/q/project:openstack/networking-odl+status:open
15:17:34 <yamahata> we have many bug fix patches which were floating around.
15:17:47 <yamahata> After Pike-3, let's wipe them out.
15:17:58 <mkolesni> theres some cleaning required there some of them are obsolete obviously
15:18:39 <mkolesni> so to sum it up, for this week need to focus on:
15:18:48 <mkolesni> 1. https://review.openstack.org/474851 - done, needs to be merged
15:19:28 <mkolesni> 2. https://review.openstack.org/#/c/465735/
15:19:38 <mkolesni> 3.  https://review.openstack.org/#/c/452647
15:19:43 <mkolesni> anything else?
15:20:07 <yamahata> that's the priority.
15:20:19 <yamahata> I think three is already many.
15:20:22 <mkolesni> yes thats the rfes
15:20:34 <mkolesni> well first one is +2 by both of us
15:20:44 <mkolesni> its a technicality to merge it after the gate is fixed
15:21:23 <yamahata> good summary. let's move on
15:21:25 <yamahata> #topic patches/bugs
15:21:31 <yamahata> we've already discussed patches.
15:21:43 <yamahata> and we'll look into ci breakage.
15:21:48 <yamahata> #topic tempest CI
15:21:52 <yamahata> mkolesni: you're on stage
15:22:29 <mkolesni> right
15:22:48 <mkolesni> so as you know ive been investigating tempest ci breakage
15:23:02 <mkolesni> found some bugs here and there all fixed now
15:23:12 <mkolesni> but the status is still dire
15:23:36 <mkolesni> so lets discuss per job basis..
15:23:52 <mkolesni> first, gate-tempest-dsvm-networking-odl-boron-snapshot-v2driver which is our only voting job (and also gating)
15:24:07 <mkolesni> this job is very unstable
15:24:39 <yamahata> really unstable! It's with legacy netvirt. So I don't see much value to fix it.
15:24:41 <mkolesni> i believe the cause is some mess up in the set up of the DHCP so that somehow traffic slips across subnets on the DHCP nodes
15:24:48 <mkolesni> indeed
15:24:53 <mkolesni> but just to understand the cause
15:25:01 <yamahata> I suppose once we have carbon with new netvirt voting, we can retire boron job or disable unstable tests of boron.
15:25:13 <yamahata> Oh, great! what's that?
15:25:16 <mkolesni> so basically what you'll see when it fails its because VMs dont get IP
15:25:32 <mkolesni> and on the VM boot log you see it got DHCP NAK
15:26:07 <mkolesni> and also you see it in the dhcp log where you see each request gets answered by the dnsmasq on that subnet (DHCP ACK)
15:26:20 <mkolesni> and also 2 other dnsmasq on other subnets (DHCP NAK)
15:26:33 <mkolesni> so basically this sucks but i didnt investigate further
15:26:50 <yamahata> are those dhcp agent on same network?
15:26:52 <mkolesni> because, as you said, its old netvirt so i doubt if anyone's going to fix it
15:27:01 <mkolesni> no theyre on different subnets
15:27:11 <mkolesni> but somehow they get the dhcp request as well
15:27:12 <yamahata> I mean, network, not subnet
15:27:32 <rajivk_> mkolesni, i also noticed disk write failure
15:27:34 <mkolesni> no i think theyre even on different tenants but im not sure
15:27:50 <yamahata> I see.
15:27:56 <mkolesni> rajivk_, yes there might be other failures im just describing what i saw most of the time
15:28:07 <mkolesni> anyway old netvirt, not interesting
15:28:22 <mkolesni> ok can i move on to next job?
15:28:29 <rajivk_> Is it failure to acquire lease again or just to get IP first time?
15:28:48 <mkolesni> rajivk_, it fails to get ip on vm boot
15:28:49 <rajivk_> I mean, are they failed after machine reboots or in all the test cases?
15:29:00 <mkolesni> about 10 times or something and then gives up
15:29:16 <mkolesni> from what i saw every time ip is requested
15:29:39 <mkolesni> its consistent, all the same test fail each time because of this issue
15:29:49 <rajivk_> I checked in one of the patch log, it was requesting a specific ips but server responded with NAK.
15:30:01 <rajivk_> anyway, we can leave as you said.
15:30:08 <mkolesni> yes lets continue
15:30:12 <mkolesni> next is gate-tempest-dsvm-networking-odl-carbon-snapshot-vpnservice-v1driver-nv
15:30:39 <mkolesni> so this one had a problem that the port status updater wasnt loaded at all causing random failures
15:30:44 <mkolesni> that got fixed
15:31:01 <mkolesni> i didnt continue too much on it since it's v1 driver
15:31:19 <mkolesni> but its rather unstable, though it's non voting so meh
15:31:42 <mkolesni> the only problem is that it stalls results until it times out but i guess we can live with that for now
15:32:07 <mkolesni> not sure how much value it provides so we can decide to drop it entirely once P-3 is out
15:32:24 <mkolesni> yamahata, whats the plan about V1, is it cut from the tree on Queens?
15:32:50 <yamahata> Maybe, if we can have v2driver voting, it makes sense to retire v1driver.
15:32:58 <manjeets> ++
15:33:04 <mkolesni> we can throw it out when we cut it out if the tree
15:33:24 <yamahata> Yeah. So far v2driver job isn't stable enough.
15:33:27 <mkolesni> RH has no interest in V1 driver so as far as we're concerned the sooner the better
15:33:36 <mkolesni> problem is no job is stable enough :)
15:33:52 <yamahata> so we had v1driver job for comparison to understand where the issue exists.
15:34:03 <yamahata> but right now they are both too unstable.
15:34:31 <mkolesni> we can send an email and see if theres any objection to throwing out the job
15:34:37 <yamahata> anyway we should focus v2driver job.
15:34:38 <mkolesni> if not we can remove the job at lease
15:34:44 <mkolesni> *at least
15:35:19 <yamahata> Once v2driver job is stable, it's okay to remove v1job.
15:36:10 <mkolesni> ok as you wish
15:36:34 <mkolesni> ok now the big guy gate-tempest-dsvm-networking-odl-carbon-snapshot-vpnservice-v2driver-nv
15:36:51 <mkolesni> so this one also had a bug that the provisioning block wasnt created
15:37:18 <mkolesni> so port status update failed to actually do anything and then nova would randomly timeout VMs
15:37:48 <mkolesni> depending on a race there so sometimes a VM vould boot normally because the provisioning by dhcp was fast enough
15:37:52 <mkolesni> anyway that got fixed
15:38:12 <mkolesni> now the major issue im noticing with it is something i believe is a problem in ODL
15:38:20 <yamahata> now we're seeing sometime the carbon v2 job is passing
15:38:36 <mkolesni> i sent an email about it to netvirt-dev, let me find it
15:38:46 <yamahata> It's plausible that the issues are in ODL side.
15:39:23 <mkolesni> #info https://lists.opendaylight.org/pipermail/netvirt-dev/2017-July/005062.html
15:39:27 <yamahata> there are many ERROR logs in karaf log.
15:39:38 <mkolesni> so to sum it all up from the email, the FIP is sometimes broken
15:40:03 <mkolesni> now again we're seeing a situation where either the tests are all green, or all tests requiring FIP fail
15:40:16 <mkolesni> or at least the same tests fail every time
15:40:36 <mkolesni> so this leads me to believe the problem happens when the public network gets created on ODL
15:40:54 <yamahata> All pass or all fail is interesting observation.
15:41:02 <mkolesni> problem is im not that strong on odl side so thats why i asked for assistance
15:41:16 <mkolesni> but nobody stepped up yet and the mail saw little interest
15:41:45 <mkolesni> if you guys have better netvirt knowledge you can take a look
15:41:47 <yamahata> maybe we would like to replicate it with nitrogen.
15:42:02 <mkolesni> if not im trying to get some help from our ODL team at RH
15:42:18 <yamahata> Cool.
15:42:25 <mkolesni> hmm nitrogen jobs are all broken because its not built by the integrated job yet
15:42:42 <mkolesni> so yamahata rajivk_ or manjeets do you guys have knowledge to debug this?
15:43:07 <manjeets> mkolesni, I guess they need to have nitrogen-snapshot available
15:43:09 <yamahata> off course, we have. The issue is their bandwidth.
15:43:35 <manjeets> it fails at getting nitrogen-snapshot
15:43:42 <yamahata> Anyway after Pike-3, I'll also look into it.
15:43:51 <mkolesni> anyways i believe that the new-netvirt job should be the voting one and the old-netvirt should be non voting, the opposite of what happens today
15:44:00 <yamahata> With nitrogen, karaf-distribution is not created yet.
15:44:18 <mkolesni> yes we can perhaps use only the netvirt karaf
15:44:18 <yamahata> karaf or netvirt image needs to be used.
15:44:47 <mkolesni> basically netvirt karaf probably has everything we need
15:44:50 <mkolesni> so we can try it
15:44:55 <yamahata> So far ODL community doesn't have ETA to create karaf-distribution image.
15:45:00 <mkolesni> i had some experimental patch to use netvirt karaf
15:45:03 <mkolesni> this seems plausible
15:45:23 <mkolesni> https://review.openstack.org/#/c/482453/
15:45:34 <mkolesni> but i didnt dig into the test failures too much
15:46:05 <mkolesni> regarding sfc im not sure
15:47:01 <mkolesni> i think it does have sfc but dont take my word for it
15:47:19 <mkolesni> is sfc even tested by tempest though?
15:47:41 <yamahata> I guess no. I guess no one seriously has tested sfc.
15:48:04 <mkolesni> so for gate maybe its enough to use the netvirt karaf
15:48:20 <yamahata> Probably unit test for sfc will be kept. tempest tests for sfc won't be enabled.
15:48:25 <mkolesni> i can rebase that patch if you want to see whats up
15:48:42 <yamahata> I'd love to see the result.
15:48:50 <mkolesni> luckily unit test doesnt care about what distribution we use :)
15:49:47 <yamahata> ODL nitrogen cycle is short. so we should know issue early.
15:50:36 <mkolesni> nitrogen would be targeted by queens though right?
15:50:57 <mkolesni> obviously we need to know asap but im asking regarding the "optimal versions"
15:51:22 <yamahata> In that sense, yeah queens + nitrogen, pike + carbon.
15:51:39 <mkolesni> btw with that experimental patch obviously old netvirt job fails cause it's not in the distribution even in boron :)
15:52:01 <yamahata> Also netvirt folks have started similar discussion.
15:52:24 <mkolesni> again its hard to debug cause if the gate times out then no logs are collected
15:52:47 <yamahata> Probably we'd like to disable some tests with floating ip so that we can have logs.
15:52:49 <mkolesni> also something thats been bothering me but i dont know how to solve is that these damn logs are in html
15:53:14 <yamahata> https://review.openstack.org/#/c/486177/
15:53:19 <mkolesni> and thats doubling their size making reading them tougher
15:53:22 <yamahata> there is something wrong with the patch.
15:53:54 <mkolesni> with what patch?
15:54:13 <yamahata> to disable some fip tests.
15:54:48 <mkolesni> ok i didnt see that i have some other ones to reduce the load so that logs do get collected and thats what i been using to debug the gate
15:54:48 <yamahata> we have 6mins left.
15:55:10 <manjeets> yamahata, mkolesni I switched grenade job to new netvirt, v2 driver, I still see it fails on floating Ip access tests
15:55:18 <yamahata> do we have anything else?
15:55:42 <mkolesni> so just to make sure, if FIP fails you should see some errors about GARP in karaf.log
15:55:50 <mkolesni> manjeets, please check if thats the case ^
15:55:56 <mkolesni> if so its the same as in tempest
15:56:28 <mkolesni> i got nothing else basically we should switch to new netvirt job asap and keep old job as non voting for reference
15:56:48 <mkolesni> problem right now they both seem to be failing about half of the time
15:56:54 <mkolesni> so its hard to say whats worse
15:56:59 <mkolesni> but its a true nightmare
15:57:17 <mkolesni> also on the gate queue it takes 3 hours till it fails :/
15:57:40 <manjeets> timeouts are very common these days
15:57:43 <mkolesni> maybe we should consider slimmer tempest on the gate itself
15:57:53 <mkolesni> and keep the heavy tests only on the check queue
15:58:20 <yamahata> parallel execution is one way, and neutron did. but it's too early for us....
15:58:33 <yamahata> anyway anything else to discuss/compalin?
15:58:41 <mkolesni> hmm yeahs thats a whole other discussion :)
15:58:50 <yamahata> #topic open mike
15:58:52 <mkolesni> no im done, stick a fork in me :)
15:59:19 <mkolesni> ok thanks guys
15:59:24 <yamahata> thank you everyone
15:59:25 <mkolesni> have a good day/night
15:59:28 <manjeets> thank you
15:59:31 <yamahata> #topic cookies
15:59:31 <mkolesni> bye :)
15:59:37 <yamahata> #endmeeting