16:01:56 <shague> #startmeeting NetVirt Weekly 02/13/18 16:01:56 <odl_meetbot> Meeting started Tue Feb 13 16:01:56 2018 UTC. The chair is shague. Information about MeetBot at http://ci.openstack.org/meetbot.html. 16:01:56 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:56 <odl_meetbot> The meeting name has been set to 'netvirt_weekly_02_13_18' 16:02:08 <shague> #topic Roll call and agenda bashing (please #info <your-nick>) 16:02:10 <vpickard> #info vpickard 16:02:25 <shague> #topic Review existing action items 16:02:39 <shague> #link https://meetings.opendaylight.org/opendaylight-netvirt/2018/netvirt_weekly_02_06_18/opendaylight-netvirt-netvirt_weekly_02_06_18.2018-02-06-16.00.html 16:02:49 <jhershbe> #info jhershbe 16:03:27 <shague> #chair vpickard shague jhershbe 16:03:27 <odl_meetbot> Current chairs: jhershbe shague vpickard 16:03:54 <shague> #topic [DONE] vpickard to send jira for compute node reboot missing host config 16:03:57 <aswin__> #info Aswin S 16:04:10 <shague> #info vivekanandan looking into networking-l2gw plugin.sh issue 16:04:17 <shague> #topic vivekanandan looking into networking-l2gw plugin.sh issue 16:04:31 <shague> #info vpickard pushed a patch to fix 16:04:49 <vpickard> #link https://review.openstack.org/#/c/542205/ 16:05:01 <Hanamantagoud> #info Hanamantagoud 16:05:22 <vorburger> #info vorburger 16:05:26 <shague> #info this is a fix for queens 16:06:09 <shague> #info upstream csit still has issues for queens 16:06:18 <shague> #topic daya to follow up on next round of patches. action item pending from last week router chaining specs: [spec](https://git.opendaylight.org/gerrit/#/c/65948/) 16:07:10 <shague> #info aswin__ has comments on spec 16:07:21 <shague> #info daya still has concerns 16:07:44 <shague> #topic https://trello.com/c/SCmPOAY6/18-carbon-release-planning - looking good for netvirt - sr3 blocked on an ofp bug (which was casused by fixing a netvirt bug) [tracking sheet](https://docs.google.com/spreadsheets/d/1VcB12FBiFV4GAEHZSspHBNxKI_9XugJp-6Qbbw20Omk/edit#gid=40307633) 16:08:24 <shague> #topic https://trello.com/c/iD2fOfF1/16-nitrogen-release-planning - branch is locked for sr2 build - looks good 16:09:30 <shague> #topic https://trello.com/c/BTurOwXh/42-oxygen-release-planning - 2/14/18 RC0 - Tomorrow. Looking tight. 16:10:10 <shague> #info dualstack patches - Valentina has last two of the internet series ready, with bugs filed. How do we want to proceed? 16:12:21 <shague> #info two internet patches are about ready to merge 16:12:44 <shague> #info 5 or so dualstack patches are left - we can wait on those patches 16:16:12 <shague> #info next patches are the upstream fixes, hanamanant's then acthuth's 16:18:50 <shague> #info smashekar's patches are also ready. aswin has reviewed. just need gates 16:27:23 <shague> #info l2gw patches next 16:27:37 <vpickard> #action vpickard to get l2gw csit gates running on outstanding l2gw patches 16:29:17 <shague> #topic genius auto-tz 16:29:25 <shague> #info downstream looks good 16:29:38 <shague> #info can merge the default to genius auto-tz 16:30:24 <shague> #topic upgradability 16:30:57 <shague> #info jhershbe asks why some vpn objects are not recreated in mdsal 16:40:28 <shague> #topic router chaining spec https://git.opendaylight.org/gerrit/#/c/65948/ 16:41:43 <sridharg> #info Sridhar Gaddam 16:42:02 <shague> #info concern about installing higher priority prefix routes 16:42:45 <shague> #info how can those flows coexist with the other existing flows 16:43:34 <shague> #info are there any implications if policies are applied to routers, like firewall 19:24:45 <jamoluhrsen> vpickard: you want this guy in? https://git.opendaylight.org/gerrit/c/68258/1/csit/suites/l2gw/01_Configure_verify_l2gateway.robot 19:25:25 <vpickard> jamoluhrsen: running a job now, lets see how that goes first, will -1 for now and +1 when job passes. Thanks 19:25:34 <jamoluhrsen> vpickard: 10-4 19:25:54 <jamoluhrsen> vpickard: I'm reviewing your other one too. I'll let you +1/-1 that one the same ok? 19:26:00 <jamoluhrsen> https://git.opendaylight.org/gerrit/c/67173 19:26:03 <vpickard> jamoluhrsen: yes, thanks 19:47:34 <vpickard> jamoluhrsen: https://git.opendaylight.org/gerrit/#/c/68258/ is good to go 19:50:21 <jamoluhrsen> vpickard: merged 20:13:51 <vpickard> jamoluhrsen: thanks 17:51:02 <vpickard> jamoluhrsen: https://git.opendaylight.org/gerrit/#/c/67173/ is ready to merge when you get a chance 17:51:36 <vpickard> jamoluhrsen: also, had to make one little tweak to get the openstack branch check right, in the last patch if you stashed that in some wiki/notes 18:01:57 <jamoluhrsen> vpickard: will look shortly. tsc mtg now 18:02:16 <vpickard> jamoluhrsen: 10-4 18:59:58 <jamoluhrsen> vpickard: seen this before? https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-pike-upstream-stateful-carbon/68/compute_1/stack.log.gz 19:00:21 <vpickard> jamoluhrsen: looking 19:01:52 <vpickard> jamoluhrsen: no, this looks new.... 19:02:06 <vpickard> jamoluhrsen: oh wait... 19:02:18 <vpickard> 2018-02-16 00:56:53.693 | Failed to discover available identity versions when contacting http://10.30.170.113/identity. Attempting to parse version from URL. 19:03:16 <vpickard> yes, seems i did recently, reran the job again and didnt see the issue 19:03:28 <jamoluhrsen> vpickard: the carbon SR3 candidate failed to stack on 4 hwvtep jobs. 19:03:58 <vpickard> let me look, it may be the networking-l2gw plugin stuff. hang on 19:04:04 <vpickard> jamoluhrsen: ^^ 19:04:07 <jamoluhrsen> vpickard: thanks man 19:08:43 <vpickard> jamoluhrsen: it is not the networking-l2gw plugin issue that I thought might be an issue, was something i fixed on queens, but the control node stacked fine. 19:10:08 <jamoluhrsen> vpickard: hmmm.... 19:10:21 <jamoluhrsen> vpickard: we expect carbon to be fine right? 19:10:27 <vpickard> jamoluhrsen: yeah, for sure 19:10:36 <vpickard> jamoluhrsen: you ran 4 jobs, on sandbox? 19:11:06 <vpickard> jamoluhrsen: job 67 is blue, ran yesterday 19:11:36 <jamoluhrsen> vpickard: no this is releng, and this is how we are vetting carbon SR3 is ready to go. so we have to 'splain the failures 19:11:54 <jamoluhrsen> vpickard: I am rerunning one job now. if it stacks and runs robot, I'll re-run the others. 19:12:04 <vpickard> jamoluhrsen: ok 19:12:14 <jamoluhrsen> vpickard: but if it also fails to stack we'll have to figure out WTH is going on 19:13:10 <vpickard> jamoluhrsen: did netvirt stack ok with SR3 candidate? 19:13:26 <jamoluhrsen> vpickard: yeah. 19:13:39 <vpickard> jamoluhrsen: is hwvtep job the only job that failed to stack like this? 19:14:01 <jamoluhrsen> vpickard: yeah. so far as I can tell 19:14:28 <vpickard> jamoluhrsen: ok, the other thing different in the jobs is that hwvtep does not have the performance vms like netvirt 19:14:38 <vpickard> jamoluhrsen: i have an open patch to switch over to those 19:14:46 <vpickard> jamoluhrsen: that might be part of it 19:15:10 <jamoluhrsen> vpickard: link? what do you mean "switch over to those"? 19:15:19 <vpickard> jamoluhrsen: or, at least, that is a difference between the job configurations 19:16:38 <vpickard> jamoluhrsen: ok, that patch was merged that I was referring to about the vm types for the job... https://git.opendaylight.org/gerrit/#/c/68310/ 19:17:13 <jamoluhrsen> vpickard: ah. I remember that patch. 19:17:15 <vpickard> jamoluhrsen: which went in yesterday, looks like 19:17:21 <jamoluhrsen> vpickard: that affected carbon maybe? 19:17:50 <vpickard> jamoluhrsen: I dont think so, the only real change was to switch the vm type. Rest was comestic cleanup 19:18:20 <vpickard> jamoluhrsen: netvirt has these same vms in carbon, right? Thats where I got the changes from 19:18:28 <vpickard> netvirt yaml job 19:18:37 <jamoluhrsen> vpickard: double checking. 19:19:17 <shague> jamoluhrsen: vpickard: that job failed to stack because of rabbitmq 19:19:26 <shague> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-pike-upstream-stateful-carbon/68/compute_1/n-cpu.log.2018-02-16-005045.gz 19:19:47 <shague> notice the exception in the beginning, once that happens nova-compute is dead 19:20:06 <shague> then in the stack.sh you see it is trying to find the nov-compute - but it is dead so it never finds it 19:20:07 <vpickard> shague: thanks shague 19:21:04 <vpickard> shague: so, what if anything to do about this? 19:21:07 <jamoluhrsen> vpickard: shague: maybe specifying the vm flavor is the culprit? 19:21:25 <jamoluhrsen> vpickard: that's the only real change right? 19:21:57 <vpickard> jamoluhrsen: shague: yeah, but I thought I had seen this before in one of my recent jobs, just reran the job, let me see if I can find that in sanbox, if it was this week 19:22:00 <shague> what patch is this in or wht eother changes to the job? 19:22:15 <vpickard> jamoluhrsen: shague: https://git.opendaylight.org/gerrit/#/c/68310/ 19:24:37 <vpickard> jamoluhrsen: shague: nope, all my jobs in sandbox from this week are oxygen 19:25:28 <vpickard> shague: do you think changing the type of vm in the job would cause this issue? These are the same as netvirt vms 19:25:34 <vpickard> I dont see how that could be it 19:26:42 <shague> yeah, that shouldn't matter. the vms have started fine 19:28:15 <shague> do you only have compute running on the compute_1 - or is the control ndoe also supposed to ahve compute? 19:29:30 <vpickard> shague: should only have comput running on compute_1, if i recall correctly. I havent touched any of that 19:30:00 <vpickard> im pushing a pike/carbon job now to start while we look 19:30:07 <vpickard> jamoluhrsen: did you start another job? 19:31:36 <jamoluhrsen> vpickard: yeah. 19:31:50 <jamoluhrsen> vpickard: https://jenkins.opendaylight.org/releng/job/netvirt-csit-hwvtep-1node-openstack-ocata-upstream-stateful-carbon/69/ 19:31:59 <vpickard> jamoluhrsen: ok, I started this one 19:32:01 <shague> I see the problem: 2018-02-16 00:57:06.671 | + lib/rpc_backend:rpc_backend_add_vhost:109 : sudo rabbitmqctl set_permissions -p nova_cell1 stackrabbit '.*' '.*' '.*' 19:32:04 <jamoluhrsen> vpickard: oh. it stacked and is running robot already. 19:32:14 <vpickard> jamoluhrsen: ok, that is good 19:32:24 <jamoluhrsen> vpickard: I'll rerun the other 3 now too 19:32:26 <shague> 00:57:06 is too late 19:32:27 <shague> 2018-02-16 00:56:55.373 27507 CRITICAL nova [req-cdcfe6e3-a463-421b-ab25-44d9ddb787ac - -] Unhandled error: NotAllowed: Connection.open: (530) NOT_ALLOWED - access to vhost 'nova_cell1' refused for user 'stackrabbit' 19:33:08 <shague> notice the compute tried to connect to rabbit at 00:56:55 - but the control node didn't have it configured until 00:57:06 19:33:32 <shague> the nova-compute throws an exception in this case and nerver restarts 19:33:55 <vpickard> shague: good debug! 19:34:09 <shague> but back at :2018-02-16 00:48:53.538 | + lib/rpc_backend:restart_rpc_backend:92 : sudo rabbitmqctl change_password stackrabbit admin 19:35:58 <shague> that is whwn rabbit is checked by the run.sh to see if rabbitmq is up, so at that point it lets the compute start stacking 19:36:03 <shague> 00:49:28 rabbitmq is ready, starting 1 compute(s 19:36:26 <shague> it thinks rabbitmq started in it's figth iteration - I don't think I ahve ever seen it start that fast 19:39:39 <shague> guess we could add more to the is_rabbitmq_ready to actually check if that nova_cell1 is there 19:40:06 <shague> current;y the ready function just cehcks if there is a pid for the rabbitmq on the control node, so it knows rabbitmq is running 19:40:43 <shague> but in your test, rabbit started but it took another 6 minutes before the nova_cell1 was configured 19:41:21 <shague> but the compute was now stacking during this time and 5 minutes later it tried to connect, the nova_cell1 wasn't there and blew up 19:42:20 <vpickard> is there a way to check if nova_cell1 is there? sounds like thats what is needed 19:43:37 <shague> sure there is... look for that nova_cell1 create in the stack.sh for what api devstack is using. then use a similar. 19:44:40 <vpickard> jenkins is goint go shut down again 19:44:44 <shague> one other option may be to just use a placement-client on the control node also which might make the cell1 create earlier 19:45:08 <jamoluhrsen> vpickard: do we have a bug or patch to address this: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-hwvtep-1node-openstack-ocata-upstream-stateful-carbon/69/robot-plugin/log_full.html.gz 19:45:55 <vpickard> the one failure at the end? Yes, I have a patch in progress. This 1 failure was caused by my patch were I added some new test cxases 19:46:08 <vpickard> https://git.opendaylight.org/gerrit/#/c/68369/ 19:46:29 <jamoluhrsen> vpickard: cool thanks. I just want to note that we know what's going on with the failure and we are working on it. 19:47:32 <vpickard> Its weird, my patch should have fixed it, but ${OPENSTACK_BRANCH} is empty when that patch runs in the new function 19:47:52 <vpickard> so, a little more debug on that one 19:50:32 <vpickard> jamoluhrsen: the cleanup code is attempting to delete a port that was not created, in the conditional branch stuff. So, the latest patch does conditional branch check and only attemtps to delete port if it was allocated.... 19:50:54 <jamoluhrsen> vpickard: ack. let me know when the patch is ready. 19:51:01 <jamoluhrsen> vpickard: speaking of ready patches, is this ready: https://git.opendaylight.org/gerrit/c/68330/ 19:53:00 <vpickard> jamoluhrsen: not quite yet. the pike job ran, but the queens job bombed, I dont think it is my patch, pretty sure, but I need to figure out why that queens run bombed, I started another queens job, but been too busy bouncing between tasks today 19:53:17 <jamoluhrsen> vpickard: I looked. ODL didn't boot up 19:54:32 <vpickard> jamoluhrsen: hm. i dont think tinyrpc version would cause that 19:54:50 <jamoluhrsen> vpickard: interesting. haven't seen this in a long time: 22:55:01 looking for "BindException: Address already in use" in log file 19:54:50 <jamoluhrsen> 22:55:01 ABORTING: found BindException: Address already in use 19:55:10 <jamoluhrsen> vpickard: no. it's oxygen and something is broken on the ODL side. 19:55:37 <jamoluhrsen> vpickard: 19:55:38 <jamoluhrsen> 22:55:02 2018-02-15T22:54:37,229 | WARN | pool-22-thread-2 | Activator | 125 - org.apache.karaf.management.server - 4.1.3 | Error starting activator 19:55:38 <jamoluhrsen> 22:55:02 java.rmi.server.ExportException: Port already in use: 1099; nested exception is: 19:55:47 <jamoluhrsen> vpickard: not your problem btw. 19:55:58 <vpickard> jamoluhrsen: ok, thanks for the quick debug 19:56:36 <vpickard> shague: so, sam, what do you think about the rabbitmq issue? You seem to have a good handle on it, you gonna take a crack at a patch? 19:56:52 <jamoluhrsen> vpickard: problem is, if it's a new bug that's crept in, it will abort all netvirt csit going forwards 20:02:48 <vpickard> https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-vic-upstream-stateful-carbon/1/console 20:02:58 <vpickard> jamoluhrsen: this job stacked, and is running 20:03:55 <jamoluhrsen> vpickard: yeah, that bindexception is not coming every time. I pulled the exact same distro locally and tried. no problem. 20:04:45 <vpickard> jamoluhrsen: but, my job with that issue was queens/oxygen, is that what you ran? 20:05:02 <vpickard> or, guess it would just need to be oxygen 20:05:22 <vpickard> not carbon 20:07:14 <jamoluhrsen> vpickard: yeah I pulled the oxy distro down and just started it to see if that bindexception came. 02:39:21 <shague> #endmeeting