16:01:56 #startmeeting NetVirt Weekly 02/13/18 16:01:56 Meeting started Tue Feb 13 16:01:56 2018 UTC. The chair is shague. Information about MeetBot at http://ci.openstack.org/meetbot.html. 16:01:56 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:56 The meeting name has been set to 'netvirt_weekly_02_13_18' 16:02:08 #topic Roll call and agenda bashing (please #info ) 16:02:10 #info vpickard 16:02:25 #topic Review existing action items 16:02:39 #link https://meetings.opendaylight.org/opendaylight-netvirt/2018/netvirt_weekly_02_06_18/opendaylight-netvirt-netvirt_weekly_02_06_18.2018-02-06-16.00.html 16:02:49 #info jhershbe 16:03:27 #chair vpickard shague jhershbe 16:03:27 Current chairs: jhershbe shague vpickard 16:03:54 #topic [DONE] vpickard to send jira for compute node reboot missing host config 16:03:57 #info Aswin S 16:04:10 #info vivekanandan looking into networking-l2gw plugin.sh issue 16:04:17 #topic vivekanandan looking into networking-l2gw plugin.sh issue 16:04:31 #info vpickard pushed a patch to fix 16:04:49 #link https://review.openstack.org/#/c/542205/ 16:05:01 #info Hanamantagoud 16:05:22 #info vorburger 16:05:26 #info this is a fix for queens 16:06:09 #info upstream csit still has issues for queens 16:06:18 #topic daya to follow up on next round of patches. action item pending from last week router chaining specs: [spec](https://git.opendaylight.org/gerrit/#/c/65948/) 16:07:10 #info aswin__ has comments on spec 16:07:21 #info daya still has concerns 16:07:44 #topic https://trello.com/c/SCmPOAY6/18-carbon-release-planning - looking good for netvirt - sr3 blocked on an ofp bug (which was casused by fixing a netvirt bug) [tracking sheet](https://docs.google.com/spreadsheets/d/1VcB12FBiFV4GAEHZSspHBNxKI_9XugJp-6Qbbw20Omk/edit#gid=40307633) 16:08:24 #topic https://trello.com/c/iD2fOfF1/16-nitrogen-release-planning - branch is locked for sr2 build - looks good 16:09:30 #topic https://trello.com/c/BTurOwXh/42-oxygen-release-planning - 2/14/18 RC0 - Tomorrow. Looking tight. 16:10:10 #info dualstack patches - Valentina has last two of the internet series ready, with bugs filed. How do we want to proceed? 16:12:21 #info two internet patches are about ready to merge 16:12:44 #info 5 or so dualstack patches are left - we can wait on those patches 16:16:12 #info next patches are the upstream fixes, hanamanant's then acthuth's 16:18:50 #info smashekar's patches are also ready. aswin has reviewed. just need gates 16:27:23 #info l2gw patches next 16:27:37 #action vpickard to get l2gw csit gates running on outstanding l2gw patches 16:29:17 #topic genius auto-tz 16:29:25 #info downstream looks good 16:29:38 #info can merge the default to genius auto-tz 16:30:24 #topic upgradability 16:30:57 #info jhershbe asks why some vpn objects are not recreated in mdsal 16:40:28 #topic router chaining spec https://git.opendaylight.org/gerrit/#/c/65948/ 16:41:43 #info Sridhar Gaddam 16:42:02 #info concern about installing higher priority prefix routes 16:42:45 #info how can those flows coexist with the other existing flows 16:43:34 #info are there any implications if policies are applied to routers, like firewall 19:24:45 vpickard: you want this guy in? https://git.opendaylight.org/gerrit/c/68258/1/csit/suites/l2gw/01_Configure_verify_l2gateway.robot 19:25:25 jamoluhrsen: running a job now, lets see how that goes first, will -1 for now and +1 when job passes. Thanks 19:25:34 vpickard: 10-4 19:25:54 vpickard: I'm reviewing your other one too. I'll let you +1/-1 that one the same ok? 19:26:00 https://git.opendaylight.org/gerrit/c/67173 19:26:03 jamoluhrsen: yes, thanks 19:47:34 jamoluhrsen: https://git.opendaylight.org/gerrit/#/c/68258/ is good to go 19:50:21 vpickard: merged 20:13:51 jamoluhrsen: thanks 17:51:02 jamoluhrsen: https://git.opendaylight.org/gerrit/#/c/67173/ is ready to merge when you get a chance 17:51:36 jamoluhrsen: also, had to make one little tweak to get the openstack branch check right, in the last patch if you stashed that in some wiki/notes 18:01:57 vpickard: will look shortly. tsc mtg now 18:02:16 jamoluhrsen: 10-4 18:59:58 vpickard: seen this before? https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-pike-upstream-stateful-carbon/68/compute_1/stack.log.gz 19:00:21 jamoluhrsen: looking 19:01:52 jamoluhrsen: no, this looks new.... 19:02:06 jamoluhrsen: oh wait... 19:02:18 2018-02-16 00:56:53.693 | Failed to discover available identity versions when contacting http://10.30.170.113/identity. Attempting to parse version from URL. 19:03:16 yes, seems i did recently, reran the job again and didnt see the issue 19:03:28 vpickard: the carbon SR3 candidate failed to stack on 4 hwvtep jobs. 19:03:58 let me look, it may be the networking-l2gw plugin stuff. hang on 19:04:04 jamoluhrsen: ^^ 19:04:07 vpickard: thanks man 19:08:43 jamoluhrsen: it is not the networking-l2gw plugin issue that I thought might be an issue, was something i fixed on queens, but the control node stacked fine. 19:10:08 vpickard: hmmm.... 19:10:21 vpickard: we expect carbon to be fine right? 19:10:27 jamoluhrsen: yeah, for sure 19:10:36 jamoluhrsen: you ran 4 jobs, on sandbox? 19:11:06 jamoluhrsen: job 67 is blue, ran yesterday 19:11:36 vpickard: no this is releng, and this is how we are vetting carbon SR3 is ready to go. so we have to 'splain the failures 19:11:54 vpickard: I am rerunning one job now. if it stacks and runs robot, I'll re-run the others. 19:12:04 jamoluhrsen: ok 19:12:14 vpickard: but if it also fails to stack we'll have to figure out WTH is going on 19:13:10 jamoluhrsen: did netvirt stack ok with SR3 candidate? 19:13:26 vpickard: yeah. 19:13:39 jamoluhrsen: is hwvtep job the only job that failed to stack like this? 19:14:01 vpickard: yeah. so far as I can tell 19:14:28 jamoluhrsen: ok, the other thing different in the jobs is that hwvtep does not have the performance vms like netvirt 19:14:38 jamoluhrsen: i have an open patch to switch over to those 19:14:46 jamoluhrsen: that might be part of it 19:15:10 vpickard: link? what do you mean "switch over to those"? 19:15:19 jamoluhrsen: or, at least, that is a difference between the job configurations 19:16:38 jamoluhrsen: ok, that patch was merged that I was referring to about the vm types for the job... https://git.opendaylight.org/gerrit/#/c/68310/ 19:17:13 vpickard: ah. I remember that patch. 19:17:15 jamoluhrsen: which went in yesterday, looks like 19:17:21 vpickard: that affected carbon maybe? 19:17:50 jamoluhrsen: I dont think so, the only real change was to switch the vm type. Rest was comestic cleanup 19:18:20 jamoluhrsen: netvirt has these same vms in carbon, right? Thats where I got the changes from 19:18:28 netvirt yaml job 19:18:37 vpickard: double checking. 19:19:17 jamoluhrsen: vpickard: that job failed to stack because of rabbitmq 19:19:26 https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-pike-upstream-stateful-carbon/68/compute_1/n-cpu.log.2018-02-16-005045.gz 19:19:47 notice the exception in the beginning, once that happens nova-compute is dead 19:20:06 then in the stack.sh you see it is trying to find the nov-compute - but it is dead so it never finds it 19:20:07 shague: thanks shague 19:21:04 shague: so, what if anything to do about this? 19:21:07 vpickard: shague: maybe specifying the vm flavor is the culprit? 19:21:25 vpickard: that's the only real change right? 19:21:57 jamoluhrsen: shague: yeah, but I thought I had seen this before in one of my recent jobs, just reran the job, let me see if I can find that in sanbox, if it was this week 19:22:00 what patch is this in or wht eother changes to the job? 19:22:15 jamoluhrsen: shague: https://git.opendaylight.org/gerrit/#/c/68310/ 19:24:37 jamoluhrsen: shague: nope, all my jobs in sandbox from this week are oxygen 19:25:28 shague: do you think changing the type of vm in the job would cause this issue? These are the same as netvirt vms 19:25:34 I dont see how that could be it 19:26:42 yeah, that shouldn't matter. the vms have started fine 19:28:15 do you only have compute running on the compute_1 - or is the control ndoe also supposed to ahve compute? 19:29:30 shague: should only have comput running on compute_1, if i recall correctly. I havent touched any of that 19:30:00 im pushing a pike/carbon job now to start while we look 19:30:07 jamoluhrsen: did you start another job? 19:31:36 vpickard: yeah. 19:31:50 vpickard: https://jenkins.opendaylight.org/releng/job/netvirt-csit-hwvtep-1node-openstack-ocata-upstream-stateful-carbon/69/ 19:31:59 jamoluhrsen: ok, I started this one 19:32:01 I see the problem: 2018-02-16 00:57:06.671 | + lib/rpc_backend:rpc_backend_add_vhost:109 : sudo rabbitmqctl set_permissions -p nova_cell1 stackrabbit '.*' '.*' '.*' 19:32:04 vpickard: oh. it stacked and is running robot already. 19:32:14 jamoluhrsen: ok, that is good 19:32:24 vpickard: I'll rerun the other 3 now too 19:32:26 00:57:06 is too late 19:32:27 2018-02-16 00:56:55.373 27507 CRITICAL nova [req-cdcfe6e3-a463-421b-ab25-44d9ddb787ac - -] Unhandled error: NotAllowed: Connection.open: (530) NOT_ALLOWED - access to vhost 'nova_cell1' refused for user 'stackrabbit' 19:33:08 notice the compute tried to connect to rabbit at 00:56:55 - but the control node didn't have it configured until 00:57:06 19:33:32 the nova-compute throws an exception in this case and nerver restarts 19:33:55 shague: good debug! 19:34:09 but back at :2018-02-16 00:48:53.538 | + lib/rpc_backend:restart_rpc_backend:92 : sudo rabbitmqctl change_password stackrabbit admin 19:35:58 that is whwn rabbit is checked by the run.sh to see if rabbitmq is up, so at that point it lets the compute start stacking 19:36:03 00:49:28 rabbitmq is ready, starting 1 compute(s 19:36:26 it thinks rabbitmq started in it's figth iteration - I don't think I ahve ever seen it start that fast 19:39:39 guess we could add more to the is_rabbitmq_ready to actually check if that nova_cell1 is there 19:40:06 current;y the ready function just cehcks if there is a pid for the rabbitmq on the control node, so it knows rabbitmq is running 19:40:43 but in your test, rabbit started but it took another 6 minutes before the nova_cell1 was configured 19:41:21 but the compute was now stacking during this time and 5 minutes later it tried to connect, the nova_cell1 wasn't there and blew up 19:42:20 is there a way to check if nova_cell1 is there? sounds like thats what is needed 19:43:37 sure there is... look for that nova_cell1 create in the stack.sh for what api devstack is using. then use a similar. 19:44:40 jenkins is goint go shut down again 19:44:44 one other option may be to just use a placement-client on the control node also which might make the cell1 create earlier 19:45:08 vpickard: do we have a bug or patch to address this: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-ocata-upstream-stateful-carbon/69/robot-plugin/log_full.html.gz 19:45:55 the one failure at the end? Yes, I have a patch in progress. This 1 failure was caused by my patch were I added some new test cxases 19:46:08 https://git.opendaylight.org/gerrit/#/c/68369/ 19:46:29 vpickard: cool thanks. I just want to note that we know what's going on with the failure and we are working on it. 19:47:32 Its weird, my patch should have fixed it, but ${OPENSTACK_BRANCH} is empty when that patch runs in the new function 19:47:52 so, a little more debug on that one 19:50:32 jamoluhrsen: the cleanup code is attempting to delete a port that was not created, in the conditional branch stuff. So, the latest patch does conditional branch check and only attemtps to delete port if it was allocated.... 19:50:54 vpickard: ack. let me know when the patch is ready. 19:51:01 vpickard: speaking of ready patches, is this ready: https://git.opendaylight.org/gerrit/c/68330/ 19:53:00 jamoluhrsen: not quite yet. the pike job ran, but the queens job bombed, I dont think it is my patch, pretty sure, but I need to figure out why that queens run bombed, I started another queens job, but been too busy bouncing between tasks today 19:53:17 vpickard: I looked. ODL didn't boot up 19:54:32 jamoluhrsen: hm. i dont think tinyrpc version would cause that 19:54:50 vpickard: interesting. haven't seen this in a long time: 22:55:01 looking for "BindException: Address already in use" in log file 19:54:50 22:55:01 ABORTING: found BindException: Address already in use 19:55:10 vpickard: no. it's oxygen and something is broken on the ODL side. 19:55:37 vpickard: 19:55:38 22:55:02 2018-02-15T22:54:37,229 | WARN | pool-22-thread-2 | Activator | 125 - org.apache.karaf.management.server - 4.1.3 | Error starting activator 19:55:38 22:55:02 java.rmi.server.ExportException: Port already in use: 1099; nested exception is: 19:55:47 vpickard: not your problem btw. 19:55:58 jamoluhrsen: ok, thanks for the quick debug 19:56:36 shague: so, sam, what do you think about the rabbitmq issue? You seem to have a good handle on it, you gonna take a crack at a patch? 19:56:52 vpickard: problem is, if it's a new bug that's crept in, it will abort all netvirt csit going forwards 20:02:48 https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-vic-upstream-stateful-carbon/1/console 20:02:58 jamoluhrsen: this job stacked, and is running 20:03:55 vpickard: yeah, that bindexception is not coming every time. I pulled the exact same distro locally and tried. no problem. 20:04:45 jamoluhrsen: but, my job with that issue was queens/oxygen, is that what you ran? 20:05:02 or, guess it would just need to be oxygen 20:05:22 not carbon 20:07:14 vpickard: yeah I pulled the oxy distro down and just started it to see if that bindexception came. 02:39:21 #endmeeting