15:00:22 <yamahata_> #startmeeting neutron_northbound 15:00:22 <odl_meetbot> Meeting started Mon Apr 24 15:00:22 2017 UTC. The chair is yamahata_. Information about MeetBot at http://ci.openstack.org/meetbot.html. 15:00:22 <odl_meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:22 <odl_meetbot> The meeting name has been set to 'neutron_northbound' 15:00:23 <mkolesni> hi 15:00:32 <yamahata_> mkolesni: hi 15:00:33 <mkolesni> sorry wasn't on last week, we had holidays 15:00:41 <yamahata_> #topic agenda bashing and roll call 15:00:48 <yamahata_> #info yamahata 15:00:49 <mkolesni> #info mkolesni 15:00:58 <rajivk> #info rajivk 15:01:07 <yamahata_> #link https://wiki.opendaylight.org/view/NeutronNorthbound:Meetings meeting agenda 15:01:26 <yamahata_> do we have any other topic to discuss this week? 15:01:37 <yamahata_> I think today we have several patches to discuss. 15:01:46 <yamahata_> anything else? 15:02:01 <mkolesni> nothing special from me 15:02:11 <yamahata_> btw last week I was pulled out for other issue. This week I'm going to catch up. 15:02:25 <mkolesni> oh i can mention a bit about scale testing 15:02:42 <mkolesni> we had some scale testing done with newton code 15:02:51 <yamahata_> very cool. 15:02:59 <mkolesni> indeed 15:03:09 <mkolesni> so theres a few interesting findings there 15:03:24 <mkolesni> let me know shen i can talk about it 15:03:26 <mkolesni> when* 15:03:45 <yamahata_> Then let's discuss before patch/bug discussion. 15:04:00 <yamahata_> with pike/nitrogen planning 15:04:04 <mkolesni> ok 15:04:19 <yamahata_> anything else? 15:05:06 <vthapar> I'm also catching up after long break 15:05:28 <yamahata_> okay move on 15:05:30 <yamahata_> #topic Announcements 15:05:42 <yamahata_> Carbon release is delaying due to karaf4 issue. 15:05:53 <yamahata_> Anyway this week release review will take place 15:06:20 <yamahata_> Then we should have a release plan for Nitrogen 15:06:42 <yamahata_> For Pike I have no update. just openstack summit is approaching. 15:06:55 <yamahata_> any other announcement? 15:07:22 <mkolesni> unfortunately im unable to come to the upcoming summit 15:07:42 <yamahata_> #topic action items from last meeting 15:07:58 <yamahata_> there were several patch reviews. but I'm going to catch them up this week 15:08:05 <yamahata_> #topic Pike/Nitrogen planning 15:08:20 <yamahata_> Okay, So far I'm planning for Pike/Nitrogen 15:08:26 <yamahata_> in my mind 15:08:34 <yamahata_> dhcp port creation issue. 15:08:50 <yamahata_> port status update. I think Josh and Ritu would driver it. 15:08:57 <yamahata_> openstack rpc from odl. 15:09:10 <yamahata_> those would be major issue. 15:09:21 <yamahata_> mkolesni: you are on the stage. 15:09:29 <mkolesni> thanks 15:09:55 <mkolesni> so we had some scale testing done at red hat with netwon code base 15:10:16 <mkolesni> im not sure if the results will be published or not, if they will i'll shared the link 15:10:20 <yamahata_> I'm very curious about it. 15:10:35 <vthapar> v2 driver I presume? 15:10:41 <mkolesni> so first there was a problem where the entire cloud grinds to a halt after a while 15:10:50 <mkolesni> yes v2 driver 15:11:15 <mkolesni> so it was identified that there were too many journal threads 15:11:37 <mkolesni> the test machine had 56 cores available 15:11:48 <yamahata_> For example, do Sam or Adnre know the scale test? If yes, we can discuss at ODL DDF privately. so that you don't have to open the result publicly. 15:11:59 <mkolesni> by default, in newton, neutron will allocate all cores to api & rpc workers 15:12:08 <yamahata_> Only with single neutron server? 15:12:23 <mkolesni> no there were 3 controllers 15:12:29 <mkolesni> but each had 56 cores 15:12:46 <yamahata_> Cool. 15:12:47 <mkolesni> so there were 112 processes of neutron on controller node 15:12:59 <mkolesni> each had at least 2 journal threads 15:13:40 <mkolesni> so in total seems that the abundance of threads exhausts all db server side connections 15:13:49 <rajivk> mkolesni, was it due to limit on the connection to database? 15:14:01 <mkolesni> yes but on the db side 15:14:03 <rajivk> how many connections did you set for database? 15:14:22 <mkolesni> im not sure if it was changed or not 15:14:30 <rajivk> yeah, i worked on openstack HA a year ago, we faced this issue. 15:14:35 <mkolesni> but the same env was run with ml2+ovs combo and had no problems 15:14:38 <yamahata_> # of allowed connections in mysql needs to be adjusted with huge deployment. 15:15:02 <rajivk> yeah, we adjusted it, we faced the same issue ml2+ovs 15:15:13 <mkolesni> so api & rpc workers were set to 8 which seemed to work 15:15:37 <mkolesni> but the default max for ocata+ is 12 15:15:49 <mkolesni> and that didnt work well (again too many connections) 15:16:01 <mkolesni> so i applied journal singleton patch 15:16:16 <mkolesni> that made things lot better for 12 threads the tests passed 15:16:31 <yamahata_> Now I'm seeing your motivation of the patch. 15:16:37 <mkolesni> however seemed that journal was backing up with requests 15:16:57 <mkolesni> so i tried to analyze the logs but theyre quite lacking in detail 15:17:23 <mkolesni> seems that to fetch a journal row takes increasingly more and more time as the table fills with thousands of rows 15:17:38 <mkolesni> so maybe an index can help 15:18:07 <yamahata_> I've been thinking that the current logic is trying to pick up row based on updated time. Not the oldest one. 15:18:12 <mkolesni> anyway i wanted to also apply the dependency validation mode patch, but didnt have enough time to finish it before the testing window closed 15:18:26 <yamahata_> So far you've been sticking to such picking order. 15:18:35 <mkolesni> yes i saw your bug report 15:18:47 <mkolesni> well the journal is built to work around that 15:18:57 <mkolesni> error handling and all 15:18:58 <yamahata_> Later I'd like to explain about your dependency validation issue. 15:19:14 <rajivk> Do you think adding layer of key value store will reduce the problem to some extend. Where are the read queries can be directed to them. 15:19:21 <mkolesni> anyway i didnt have time to see if the dependency validation improves the situation or not 15:19:55 <rajivk> or may be we can keep some data structure to store dependencies etc to reduce load on database. 15:20:08 <mkolesni> im not sure we need to further analyze the logs and perhaps run profiler to identify the hot spots 15:20:48 <mkolesni> just adding cache is like shooting in the dark, not sure it will help but can be very painful :) 15:21:08 <mkolesni> we need to identify the hot spots via profiling or some log analysis 15:21:11 <mkolesni> anyway 15:21:19 <rajivk> I worked on dragonflow for sometime. 15:21:30 <mkolesni> another issue was identified with odl itself where it leaked memory 15:21:37 <mkolesni> so we had to set heap size to 22GB 15:21:51 <yamahata_> do you mean ODL jvm process? 15:21:55 <yamahata_> Hmm that's bad. 15:21:58 <mkolesni> yes 15:22:30 <vthapar> any analysis done on where the leak was? 15:22:50 <mkolesni> not that i know of, i focused on the driver part 15:23:22 <vthapar> ok 15:23:40 <mkolesni> so in conclusion the tests revelated that too many threads arent better, as i suspected 15:23:50 <mkolesni> so we can discuss that further if youd like 15:24:35 <yamahata_> regarding to netwokring-odl, there are at least two scalability issues. 15:24:42 <yamahata_> the number of journal thread. 15:25:25 <yamahata_> Maybe one thread per neutron process? or other way to tune it. e.g. only several threads per parent server. no thread in worker process e.g. 15:25:43 <mkolesni> i think maybe in rpc worker we dont need threads? 15:25:59 <mkolesni> if we dont run with dhcp agent at least 15:26:04 <yamahata_> another issue is too journaling thread would lag. 15:26:12 <mkolesni> since those threads will only be triggered by timer 15:26:43 <yamahata_> We can have api/rpc worker process. But we don't have to run journal threads in worker process. 15:27:11 <mkolesni> yes thats what i meant only have journal thread per api worker 15:27:14 <yamahata_> From your description, there are too many journal threads, right? 15:27:21 <mkolesni> yes 15:27:32 <mkolesni> so singleton patch actually mitigates it 15:27:40 <mkolesni> since it forces 1 thread per process 15:27:53 <mkolesni> but it doesn't discern if its api or rpc worker process 15:27:57 <yamahata_> We can have more flexible way other than 1 thread per process. 15:28:18 <mkolesni> actually im not sure the number of threads is really the problem 15:28:41 <yamahata_> 1 thread per process is one of the possibilities. 15:28:48 <mkolesni> i think the main problem is if that you have any X amount of threads, they will wake up every 5 seconds to check the db 15:29:12 <yamahata_> I agree that that's the bad situation. 15:29:22 <mkolesni> now this check is only a safety net for journal entries in case of network connectivity 15:29:34 <mkolesni> network connectivity issue 15:29:49 <mkolesni> so maybe we can limit the amount of threads running on this timer 15:30:07 <mkolesni> maybe thats the golden ticket 15:30:25 <mkolesni> since generally thread can be either triggered by an api request processing 15:30:28 <mkolesni> or by timers 15:30:41 <mkolesni> so by api request is fine, we need to process them obviously 15:30:59 <mkolesni> but by timer may be too much hammering for the poor db 15:31:23 <mkolesni> anyway i think the singleton patch is a good start for now and we can incorporate any other idea with it 15:31:41 <yamahata_> waking up thread would cause thundering hurd. Probably it can be easily improved somehow. 15:31:42 <mkolesni> now regarding journal backlog increasing i wasnt able to identify why it happened 15:32:04 <yamahata_> By introduing threadpool patch, it can be easily to tune 1 thread per process. It's upper compatible. 15:32:12 <mkolesni> i noticed it doesnt seem to be ODL itself 15:32:36 <yamahata_> Regarding to journal backlog, one possibly way is to throttle API processing. 15:32:39 <mkolesni> i.e. sending to odl didnt cause it to get stuck 15:33:03 <mkolesni> i think the dependency move should improve that 15:33:20 <mkolesni> since it introduce a dependency calculation during journal entry creation 15:33:31 <mkolesni> so it actually would throttle the api 15:33:48 <mkolesni> and release the journal thread processing from calculating each time 15:34:03 <mkolesni> so if you think of it as a scale it should tip the scale a bit 15:34:06 <yamahata_> I mean to throttle api processing with postcommit. 15:34:27 <mkolesni> can you elaborate on that? 15:34:57 <yamahata_> we can try to send request to ODL with postcommit. 15:34:58 <mkolesni> 414989 15:35:21 <mkolesni> well that would defeat the journaling idea imhpo 15:35:23 <mkolesni> imho 15:36:23 <mkolesni> if we do that were going back to v1 driver which did it in post commit 15:36:27 <yamahata_> I'm not sure why it defeat it. 15:36:35 <yamahata_> No. I would say. 15:36:44 <yamahata_> And also we can change the picking order. 15:37:12 <yamahata_> anyway right now those are ideas. 15:37:23 <yamahata_> Do you have any other ideas? 15:37:30 <mkolesni> yes i think more analysis is needed to identify the actual problem 15:37:40 <mkolesni> then we can suggest ideas 15:38:14 <yamahata_> mkolesni: do you have further to discuss? or move on to other patches? 15:38:46 <mkolesni> nothing further regarding this 15:38:58 <yamahata_> let's move on to patch reviews 15:39:03 <yamahata_> #topic patches/bugs 15:39:16 <yamahata_> At first, let me explain about https://review.openstack.org/#/c/453581/ 15:39:27 <yamahata_> dependency calculation patch 15:39:30 <mkolesni> sure 15:39:34 <yamahata_> It introduces a race condition. 15:39:51 <yamahata_> suppose there are two threads to update on same neutron resource. 15:40:00 <yamahata_> thread 1: start db transaction 15:40:05 <yamahata_> thread 2: start db transaction 15:40:10 <yamahata_> thread 1: calculate dependency 15:40:11 <mkolesni> oh btw did u see my reply? 15:40:17 <yamahata_> thread 2 calculate dependency 15:40:23 <yamahata_> thread 1: insert row and commit 15:40:30 <yamahata_> thread 2: insert row and commit 15:40:49 <mkolesni> https://review.openstack.org/#/c/453581/9/networking_odl/journal/journal.py 15:40:58 <yamahata_> In this case, one of thread 1 and thread 2 calculate dependency wrongly. 15:41:07 <yamahata_> It misses othre thread. 15:41:19 <yamahata_> If dependency calulated with the current code, such thing doesn't happen. 15:41:32 <mkolesni> did you read my reply there? 15:42:01 <yamahata_> Yes, and I think your replay isn't right. 15:42:08 <mkolesni> ok please explain 15:42:37 <yamahata_> Now I showed the scenario where your patch doesn't work correctly. 15:42:46 <mkolesni> it doesnt matter 15:42:55 <mkolesni> the race exists regardless of the patch 15:43:00 <mkolesni> thats what i explained 15:43:16 <mkolesni> regarding todays code: 15:43:19 <mkolesni> Let's assume 2 different update to the same resource such as port from 2 different API calls: 15:43:19 <mkolesni> 1st update gets handled by neutron DB 15:43:19 <mkolesni> 2nd update gets handled by neutron DB <- this update won 15:43:19 <mkolesni> 2nd update gets a journal entry X created 15:43:20 <mkolesni> 1st update gets a journal entry Y created 15:43:29 <mkolesni> Now when the journal thread will process the entries, the Y entry will get sent last as it was created later in the chain of events so it got a bigger seqnum. 15:43:41 <mkolesni> so this race can happen even now 15:43:53 <mkolesni> the mechanics of how it happens dont matter 15:44:06 <mkolesni> what matters is if it can happen or not 15:44:13 <yamahata_> It doesn't matter whether X or Y wins. 15:44:38 <mkolesni> of course if matters, the later update should win 15:44:46 <yamahata_> If dependency calculated during creating row X or Y, the dependency calculation doesn't see other row. 15:45:02 <yamahata_> So the dependency is wrong. 15:45:06 <mkolesni> the example i gave is with existing code base 15:45:12 <mkolesni> without the change i introduce 15:45:52 <mkolesni> in the existing code base this race can happen as well 15:46:07 <yamahata_> If we calculate dependency after committing X and Y with the existing code, dependency calculation sees the other row. 15:46:10 <mkolesni> the older update could be what gets sent to odl 15:46:45 <mkolesni> yes but in existing code they still race for the order of inserting the journal entry which will affect the dependency calculation 15:46:55 <mkolesni> so the race is still there 15:47:10 <mkolesni> regardless where or when you do the dependency calculation 15:48:09 <mkolesni> even existing code has this race 15:48:10 <yamahata_> After commiting journal entry, the dependency calculation sees all the entries before it. 15:48:30 <mkolesni> youre talking about the existing code? 15:48:39 <yamahata_> But during db transaction committing journal entry, it doesn't see other competing transaction. 15:48:41 <yamahata_> That's the issue. 15:49:19 <yamahata_> I'm talking about the difference between the existing code and your patch. 15:49:31 <yamahata_> and trying to show what's wrong. 15:49:40 <mkolesni> do you agree the existing code is raceful? 15:49:46 <yamahata_> No. 15:49:55 <mkolesni> so you say my example cant happen? 15:50:00 <mkolesni> clearly it can 15:50:11 <mkolesni> think about it a bit 15:50:40 <mkolesni> again, regarding the exsiting code base without any change: 15:50:41 <mkolesni> Let's assume 2 different update to the same resource such as port from 2 different API calls: 15:50:42 <mkolesni> 1st update gets handled by neutron DB 15:50:42 <mkolesni> 2nd update gets handled by neutron DB <- this update won 15:50:42 <mkolesni> 2nd update gets a journal entry X created 15:50:42 <yamahata_> I agree that either of competing db transaction can won. 15:50:42 <mkolesni> 1st update gets a journal entry Y created 15:50:43 <mkolesni> Now when the journal thread will process the entries, the Y entry will get sent last as it was created later in the chain of events so it got a bigger seqnum. 15:50:44 <yamahata_> That's okay. 15:50:46 <mkolesni> So, we have a discrepancy between Neutron DB and ODL. 15:50:57 <mkolesni> you say this example is wrong? 15:51:21 <yamahata_> I agree that either X or Y can win. 15:51:25 <yamahata_> That's okay. 15:51:34 <mkolesni> ok so its raceful then or not? 15:51:48 <yamahata_> During those db transaction to create X or Y, the transaction doesn't see other transaction. 15:52:01 <yamahata_> The thread to create entry X doesn't see Y. 15:52:05 <mkolesni> please focus on the existing situation 15:52:10 <yamahata_> the thread to create entry Y doesn't see X. 15:52:11 <mkolesni> not having any change 15:52:19 <mkolesni> us it raceful or not? 15:52:42 <mkolesni> is* 15:52:44 <yamahata_> So the dependency calculated during the db transaction doesn't include X or Y. 15:53:02 <mkolesni> isaku please focus on existing situation 15:53:09 <mkolesni> not any change that could happen 15:53:18 <mkolesni> is existing situation raceful? 15:53:23 <yamahata_> No. 15:53:35 <mkolesni> how no? youre contradicting yourself 15:53:52 <mkolesni> <yamahata_> I agree that either X or Y can win. 15:53:52 <mkolesni> <yamahata_> That's okay. 15:53:54 <yamahata_> If you mean either X or Y win, I agree. 15:54:02 <mkolesni> so its raceful 15:54:07 <yamahata_> If you call it racefull, yes it's raceful. 15:54:08 <mkolesni> thats the definition of raceful 15:54:21 <mkolesni> if either one of the updates can win its raceful 15:54:28 <mkolesni> correct? 15:54:42 <yamahata_> Okay, it's correct. 15:54:53 <mkolesni> the outcome can be either correct or not, depends on who wins the race 15:55:01 <mkolesni> ok and thats the existing code 15:55:04 <mkolesni> now my change 15:55:13 <mkolesni> it doesnt make this race better 15:55:23 <mkolesni> but it doesnt make it worse 15:55:26 <mkolesni> its still there 15:55:32 <mkolesni> thats all im saying 15:55:37 <yamahata_> Your patch make dependency calculation wrong. 15:55:40 <mkolesni> the race is not related to the change 15:56:07 <mkolesni> how can it be wrong if its according to the spec we agreed upon? 15:56:18 <mkolesni> im sorry youre not making much sense 15:56:19 <yamahata_> Ah in that sense, the race itself isn't the cause. 15:56:38 <yamahata_> Let's assume X win Y. 15:56:40 <mkolesni> if anything, the patch is excellent because it brought this race into our attention 15:56:45 <yamahata_> X has smaller seqnum thank Y. 15:57:01 <yamahata_> With your patch, the dependency calculation of Y doesn't include X. 15:57:14 <yamahata_> With the existing code, the dependency calculation of Y includes X. 15:57:22 <mkolesni> look the mechanism that causes the race to happen doesnt matter 15:57:44 <mkolesni> in the end if you look at it as black box the race is there before the change and it there after 15:58:11 <mkolesni> now we can debate about whether this race should be fixed or not, but you can't claim this is incorrect 15:58:37 <mkolesni> so given that this race exists even now this change is not about fixing it 15:59:15 <yamahata_> Ah, I see. I'm not claiming to remove the race. 15:59:26 <mkolesni> i agree theoretically this is a bit more dangerous, but the only real scenario that you & i can think of that would actually have a race is updating the same resource 15:59:35 <yamahata_> I'm claiming to make the dependency calculation correct with your patch. 15:59:39 <mkolesni> and this race already exists 15:59:51 <mkolesni> im not sure then what you expect 16:00:04 <mkolesni> if you have an idea please propose 16:00:05 <yamahata_> Again with your patch, dependency calculation can be wrong. 16:00:16 <yamahata_> One possible way is to use table lock. 16:00:36 <mkolesni> to solve a single race that already exists today? 16:00:43 <yamahata_> or somehow dependency calculation, insert row needs to be done atomically. 16:00:50 <mkolesni> why not do this fix separately? 16:01:12 <yamahata_> No. Because your patch newly introduces the issue. 16:01:36 <mkolesni> no we already agreed this race is there without my patch 16:01:51 <mkolesni> ok sorry gtg but we can talk about this another time 16:02:16 <yamahata_> Based on your definition, I agree that the race exists. I'm claiming your patch includes the issue with dependency calculation. 16:02:28 <yamahata_> let's continue disucssion. 16:02:34 <yamahata_> any other patches to discuss? 16:02:39 <mkolesni> bye guys 16:03:32 <rajivk> yamahata, what is the issue with the CI? 16:03:50 <yamahata_> rajivk: sorry, I haven't time to look into your patch. 16:03:53 <rajivk> I checked odl could not start 16:03:58 <yamahata_> I'm going to do this week. 16:04:04 <rajivk> thanks, 16:04:12 <yamahata_> I'm aware that the CI is broken. again. 16:04:23 <rajivk> Did you check the cause? 16:04:29 <yamahata_> not yet. 16:04:52 <rajivk> okay, just in case, you don't get much time. 16:05:13 <rajivk> Ping me result of your investigation or direction to investigate 16:05:18 <yamahata_> sure. 16:05:25 <yamahata_> anythiing else? 16:05:26 <rajivk> I will start looking into it in the morning. 16:05:34 <rajivk> no 16:05:58 <yamahata_> #topic open mike 16:06:07 <yamahata_> anything else? or can we close the meeting? 16:06:18 <rajivk> close the meeting. 16:06:30 <yamahata_> thank you everyone 16:06:36 <yamahata_> #topic cookies 16:06:41 <yamahata_> #endmeeting