#opendaylight-netvirt log

16:01:56 <shague> #startmeeting NetVirt Weekly 02/13/18
16:01:56 <odl_meetbot> Meeting started Tue Feb 13 16:01:56 2018 UTC.  The chair is shague. Information about MeetBot at http://ci.openstack.org/meetbot.html.
16:01:56 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:56 <odl_meetbot> The meeting name has been set to 'netvirt_weekly_02_13_18'
16:02:08 <shague> #topic Roll call and agenda bashing (please #info <your-nick>)
16:02:10 <vpickard> #info vpickard
16:02:25 <shague> #topic Review existing action items
16:02:39 <shague> #link https://meetings.opendaylight.org/opendaylight-netvirt/2018/netvirt_weekly_02_06_18/opendaylight-netvirt-netvirt_weekly_02_06_18.2018-02-06-16.00.html
16:02:49 <jhershbe> #info jhershbe
16:03:27 <shague> #chair vpickard shague jhershbe
16:03:27 <odl_meetbot> Current chairs: jhershbe shague vpickard
16:03:54 <shague> #topic [DONE] vpickard to send jira for compute node reboot missing host config
16:03:57 <aswin__> #info Aswin S
16:04:10 <shague> #info vivekanandan looking into networking-l2gw plugin.sh issue
16:04:17 <shague> #topic vivekanandan looking into networking-l2gw plugin.sh issue
16:04:31 <shague> #info vpickard pushed a patch to fix
16:04:49 <vpickard> #link https://review.openstack.org/#/c/542205/
16:05:01 <Hanamantagoud> #info Hanamantagoud
16:05:22 <vorburger> #info vorburger
16:05:26 <shague> #info this is a fix for queens
16:06:09 <shague> #info upstream csit still has issues for queens
16:06:18 <shague> #topic daya to follow up on next round of patches. action item pending from last week router chaining specs: [spec](https://git.opendaylight.org/gerrit/#/c/65948/)
16:07:10 <shague> #info aswin__ has comments on spec
16:07:21 <shague> #info daya still has concerns
16:07:44 <shague> #topic https://trello.com/c/SCmPOAY6/18-carbon-release-planning - looking good for netvirt - sr3 blocked on an ofp bug (which was casused by fixing a netvirt bug) [tracking sheet](https://docs.google.com/spreadsheets/d/1VcB12FBiFV4GAEHZSspHBNxKI_9XugJp-6Qbbw20Omk/edit#gid=40307633)
16:08:24 <shague> #topic https://trello.com/c/iD2fOfF1/16-nitrogen-release-planning - branch is locked for sr2 build - looks good
16:09:30 <shague> #topic https://trello.com/c/BTurOwXh/42-oxygen-release-planning - 2/14/18 RC0 - Tomorrow. Looking tight.
16:10:10 <shague> #info dualstack patches - Valentina has last two of the internet series ready, with bugs filed. How do we want to proceed?
16:12:21 <shague> #info two internet patches are about ready to merge
16:12:44 <shague> #info 5 or so dualstack patches are left - we can wait on those patches
16:16:12 <shague> #info next patches are the upstream fixes, hanamanant's  then acthuth's
16:18:50 <shague> #info smashekar's patches are also ready. aswin has reviewed. just need gates
16:27:23 <shague> #info l2gw patches next
16:27:37 <vpickard> #action vpickard to get l2gw csit gates running on outstanding l2gw patches
16:29:17 <shague> #topic genius auto-tz
16:29:25 <shague> #info downstream looks good
16:29:38 <shague> #info can merge the default to genius auto-tz
16:30:24 <shague> #topic upgradability
16:30:57 <shague> #info jhershbe asks why some vpn objects are not recreated in mdsal
16:40:28 <shague> #topic router chaining spec https://git.opendaylight.org/gerrit/#/c/65948/
16:41:43 <sridharg> #info Sridhar Gaddam
16:42:02 <shague> #info concern about installing higher priority prefix routes
16:42:45 <shague> #info how can those flows coexist with the other existing flows
16:43:34 <shague> #info are there any implications if policies are applied to routers, like firewall
19:24:45 <jamoluhrsen> vpickard: you want this guy in? https://git.opendaylight.org/gerrit/c/68258/1/csit/suites/l2gw/01_Configure_verify_l2gateway.robot
19:25:25 <vpickard> jamoluhrsen: running a job now, lets see how that goes first, will -1 for now and +1 when job passes. Thanks
19:25:34 <jamoluhrsen> vpickard: 10-4
19:25:54 <jamoluhrsen> vpickard: I'm reviewing your other one too. I'll let you +1/-1 that one the same ok?
19:26:00 <jamoluhrsen> https://git.opendaylight.org/gerrit/c/67173
19:26:03 <vpickard> jamoluhrsen: yes, thanks
19:47:34 <vpickard> jamoluhrsen: https://git.opendaylight.org/gerrit/#/c/68258/ is good to go
19:50:21 <jamoluhrsen> vpickard: merged
20:13:51 <vpickard> jamoluhrsen: thanks
17:51:02 <vpickard> jamoluhrsen: https://git.opendaylight.org/gerrit/#/c/67173/ is ready to merge when you get a chance
17:51:36 <vpickard> jamoluhrsen: also, had to make one little tweak to get the openstack branch check right, in the last patch if you stashed that in some wiki/notes
18:01:57 <jamoluhrsen> vpickard: will look shortly. tsc mtg now
18:02:16 <vpickard> jamoluhrsen: 10-4
18:59:58 <jamoluhrsen> vpickard: seen this before?  https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-pike-upstream-stateful-carbon/68/compute_1/stack.log.gz
19:00:21 <vpickard> jamoluhrsen: looking
19:01:52 <vpickard> jamoluhrsen: no, this looks new....
19:02:06 <vpickard> jamoluhrsen: oh wait...
19:02:18 <vpickard> 2018-02-16 00:56:53.693 | Failed to discover available identity versions when contacting http://10.30.170.113/identity. Attempting to parse version from URL.
19:03:16 <vpickard> yes, seems i did recently, reran the job again and didnt see the issue
19:03:28 <jamoluhrsen> vpickard: the carbon SR3 candidate failed to stack on 4 hwvtep jobs.
19:03:58 <vpickard> let me look, it may be the networking-l2gw plugin stuff. hang on
19:04:04 <vpickard> jamoluhrsen: ^^
19:04:07 <jamoluhrsen> vpickard: thanks man
19:08:43 <vpickard> jamoluhrsen: it is not the networking-l2gw plugin issue that I thought might be an issue, was something i fixed on queens, but the control node stacked fine.
19:10:08 <jamoluhrsen> vpickard: hmmm....
19:10:21 <jamoluhrsen> vpickard: we expect carbon to be fine right?
19:10:27 <vpickard> jamoluhrsen: yeah, for sure
19:10:36 <vpickard> jamoluhrsen: you ran 4 jobs, on sandbox?
19:11:06 <vpickard> jamoluhrsen: job 67 is blue, ran yesterday
19:11:36 <jamoluhrsen> vpickard: no this is releng, and this is how we are vetting carbon SR3 is ready to go. so we have to 'splain the failures
19:11:54 <jamoluhrsen> vpickard: I am rerunning one job now. if it stacks and runs robot, I'll re-run the others.
19:12:04 <vpickard> jamoluhrsen: ok
19:12:14 <jamoluhrsen> vpickard: but if it also fails to stack we'll have to figure out WTH is going on
19:13:10 <vpickard> jamoluhrsen: did netvirt stack ok with SR3 candidate?
19:13:26 <jamoluhrsen> vpickard: yeah.
19:13:39 <vpickard> jamoluhrsen: is hwvtep job the only job that failed to stack like this?
19:14:01 <jamoluhrsen> vpickard: yeah. so far as I can tell
19:14:28 <vpickard> jamoluhrsen: ok, the other thing different in the jobs is that hwvtep does not have the performance vms like netvirt
19:14:38 <vpickard> jamoluhrsen: i have an open patch to switch over to those
19:14:46 <vpickard> jamoluhrsen: that might be part of it
19:15:10 <jamoluhrsen> vpickard: link? what do you mean "switch over to those"?
19:15:19 <vpickard> jamoluhrsen: or, at least, that is a difference between the job configurations
19:16:38 <vpickard> jamoluhrsen: ok, that patch was merged that I was referring to about the vm types for the job... https://git.opendaylight.org/gerrit/#/c/68310/
19:17:13 <jamoluhrsen> vpickard: ah. I remember that patch.
19:17:15 <vpickard> jamoluhrsen: which went in yesterday, looks like
19:17:21 <jamoluhrsen> vpickard: that affected carbon maybe?
19:17:50 <vpickard> jamoluhrsen: I dont think so, the only real change was to switch the vm type. Rest was comestic cleanup
19:18:20 <vpickard> jamoluhrsen: netvirt has these same vms in carbon, right? Thats where I got the changes from
19:18:28 <vpickard> netvirt yaml job
19:18:37 <jamoluhrsen> vpickard: double checking.
19:19:17 <shague> jamoluhrsen: vpickard: that job failed to stack because of rabbitmq
19:19:26 <shague> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-pike-upstream-stateful-carbon/68/compute_1/n-cpu.log.2018-02-16-005045.gz
19:19:47 <shague> notice the exception in the beginning, once that happens nova-compute is dead
19:20:06 <shague> then in the stack.sh you see it is trying to find the nov-compute - but it is dead so it never finds it
19:20:07 <vpickard> shague: thanks shague
19:21:04 <vpickard> shague: so, what if anything to do about this?
19:21:07 <jamoluhrsen> vpickard: shague: maybe specifying the vm flavor is the culprit?
19:21:25 <jamoluhrsen> vpickard: that's the only real change right?
19:21:57 <vpickard> jamoluhrsen: shague: yeah, but I thought I had seen this before in one of my recent jobs, just reran the job, let me see if I can find that in sanbox, if it was this week
19:22:00 <shague> what patch is this in or wht eother changes to the job?
19:22:15 <vpickard> jamoluhrsen: shague: https://git.opendaylight.org/gerrit/#/c/68310/
19:24:37 <vpickard> jamoluhrsen: shague: nope, all my jobs in sandbox from this week are oxygen
19:25:28 <vpickard> shague: do you think changing the type of vm in the job would cause this issue? These are the same as netvirt vms
19:25:34 <vpickard> I dont see how that could be it
19:26:42 <shague> yeah, that shouldn't matter. the vms have started fine
19:28:15 <shague> do you only have compute running on the compute_1 - or is the control ndoe also supposed to ahve compute?
19:29:30 <vpickard> shague: should only have comput running on compute_1, if i recall correctly. I havent touched any of that
19:30:00 <vpickard> im pushing a pike/carbon job now to start while we look
19:30:07 <vpickard> jamoluhrsen: did you start another job?
19:31:36 <jamoluhrsen> vpickard: yeah.
19:31:50 <jamoluhrsen> vpickard: https://jenkins.opendaylight.org/releng/job/netvirt-csit-hwvtep-1node-openstack-ocata-upstream-stateful-carbon/69/
19:31:59 <vpickard> jamoluhrsen: ok, I started this one
19:32:01 <shague> I see the problem: 2018-02-16 00:57:06.671 | + lib/rpc_backend:rpc_backend_add_vhost:109 :   sudo rabbitmqctl set_permissions -p nova_cell1 stackrabbit '.*' '.*' '.*'
19:32:04 <jamoluhrsen> vpickard: oh. it stacked and is running robot already.
19:32:14 <vpickard> jamoluhrsen: ok, that is good
19:32:24 <jamoluhrsen> vpickard: I'll rerun the other 3 now too
19:32:26 <shague> 00:57:06 is too late
19:32:27 <shague> 2018-02-16 00:56:55.373 27507 CRITICAL nova [req-cdcfe6e3-a463-421b-ab25-44d9ddb787ac - -] Unhandled error: NotAllowed: Connection.open: (530) NOT_ALLOWED - access to vhost 'nova_cell1' refused for user 'stackrabbit'
19:33:08 <shague> notice the compute tried to connect to rabbit at 00:56:55 - but the control node didn't have it configured until 00:57:06
19:33:32 <shague> the nova-compute throws an exception in this case and nerver restarts
19:33:55 <vpickard> shague: good debug!
19:34:09 <shague> but back at :2018-02-16 00:48:53.538 | + lib/rpc_backend:restart_rpc_backend:92   :   sudo rabbitmqctl change_password stackrabbit admin
19:35:58 <shague> that is whwn rabbit is checked by the run.sh to see if rabbitmq is up, so at that point it lets the compute start stacking
19:36:03 <shague> 00:49:28  rabbitmq is ready, starting 1 compute(s
19:36:26 <shague> it thinks rabbitmq started in it's figth iteration - I don't think I ahve ever seen it start that fast
19:39:39 <shague> guess we could add more to the is_rabbitmq_ready to actually check if that nova_cell1 is there
19:40:06 <shague> current;y the ready function just cehcks if there is a pid for the rabbitmq on the control node, so it knows rabbitmq is running
19:40:43 <shague> but in your test, rabbit started but it took another 6 minutes before the nova_cell1 was configured
19:41:21 <shague> but the compute was now stacking during this time and 5 minutes later it tried to connect, the nova_cell1 wasn't there and blew up
19:42:20 <vpickard> is there a way to check if nova_cell1 is there? sounds like thats what is needed
19:43:37 <shague> sure there is... look for that nova_cell1 create in the stack.sh for what api devstack is using. then use a similar.
19:44:40 <vpickard> jenkins is goint go shut down again
19:44:44 <shague> one other option may be to just use a placement-client on the control node also which might make the cell1 create earlier
19:45:08 <jamoluhrsen> vpickard: do we have a bug or patch to address this: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-ocata-upstream-stateful-carbon/69/robot-plugin/log_full.html.gz
19:45:55 <vpickard> the one failure at the end? Yes, I have a patch in progress. This 1 failure was caused by my patch were I added some new test cxases
19:46:08 <vpickard> https://git.opendaylight.org/gerrit/#/c/68369/
19:46:29 <jamoluhrsen> vpickard: cool thanks. I just want to note that we know what's going on with the failure and we are working on it.
19:47:32 <vpickard> Its weird, my patch should have fixed it, but ${OPENSTACK_BRANCH} is empty when that patch runs in the new function
19:47:52 <vpickard> so, a little more debug on that one
19:50:32 <vpickard> jamoluhrsen: the cleanup code is attempting to delete a port that was not created, in the conditional branch stuff. So, the latest patch does conditional branch check and only attemtps to delete port if it was allocated....
19:50:54 <jamoluhrsen> vpickard: ack. let me know when the patch is ready.
19:51:01 <jamoluhrsen> vpickard: speaking of ready patches, is this ready: https://git.opendaylight.org/gerrit/c/68330/
19:53:00 <vpickard> jamoluhrsen: not quite yet. the pike job ran, but the queens job bombed, I dont think it is my patch, pretty sure, but I need to figure out why that queens run bombed, I started another queens job, but been too busy bouncing between tasks today
19:53:17 <jamoluhrsen> vpickard: I looked. ODL didn't boot up
19:54:32 <vpickard> jamoluhrsen: hm. i dont think tinyrpc version would cause that
19:54:50 <jamoluhrsen> vpickard: interesting. haven't seen this in a long time:  22:55:01 looking for "BindException: Address already in use" in log file
19:54:50 <jamoluhrsen> 22:55:01 ABORTING: found BindException: Address already in use
19:55:10 <jamoluhrsen> vpickard: no. it's oxygen and something is broken on the ODL side.
19:55:37 <jamoluhrsen> vpickard:
19:55:38 <jamoluhrsen> 22:55:02 2018-02-15T22:54:37,229 | WARN  | pool-22-thread-2 | Activator                        | 125 - org.apache.karaf.management.server - 4.1.3 | Error starting activator
19:55:38 <jamoluhrsen> 22:55:02 java.rmi.server.ExportException: Port already in use: 1099; nested exception is:
19:55:47 <jamoluhrsen> vpickard: not your problem btw.
19:55:58 <vpickard> jamoluhrsen: ok, thanks for the quick debug
19:56:36 <vpickard> shague: so, sam, what do you think about the rabbitmq issue? You seem to have a good handle on it, you gonna take a crack at a patch?
19:56:52 <jamoluhrsen> vpickard: problem is, if it's a new bug that's crept in, it will abort all netvirt csit going forwards
20:02:48 <vpickard> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-vic-upstream-stateful-carbon/1/console
20:02:58 <vpickard> jamoluhrsen: this job stacked, and is running
20:03:55 <jamoluhrsen> vpickard: yeah, that bindexception is not coming every time. I pulled the exact same distro locally and tried. no problem.
20:04:45 <vpickard> jamoluhrsen: but, my job with that issue was queens/oxygen, is that what you ran?
20:05:02 <vpickard> or, guess it would just need to be oxygen
20:05:22 <vpickard> not carbon
20:07:14 <jamoluhrsen> vpickard: yeah I pulled the oxy distro down and just started it to see if that bindexception came.
02:39:21 <shague> #endmeeting