12:59:37 #startmeeting doctor 12:59:37 Meeting started Tue May 30 12:59:37 2017 UTC. The chair is r-mibu_. Information about MeetBot at http://wiki.debian.org/MeetBot. 12:59:37 Useful Commands: #action #agreed #help #info #idea #link #topic. 12:59:37 The meeting name has been set to 'doctor' 12:59:45 #topic roll call 13:00:03 I'm stuck in a meeting. should be free in 30 minutes and connect to GTM 13:01:24 #info dwj 13:01:24 #info Ryota Mibu 13:01:44 #info Bertrand Souville 13:02:56 #link https://etherpad.opnfv.org/p/doctor_meetings 13:04:34 #info OPNFV Summit 13:05:20 #info Gerald Kunzmann 13:11:15 #info https://github.com/ansible/ansible/tree/devel/lib/ansible/modules/cloud/openstack 13:14:12 dwj, can you propose a good roast duck restaurant? 13:16:09 sure, wait a moment 13:16:21 #topic E release status 13:17:10 #link https://wiki.opnfv.org/display/doctor/Euphrates+Planning 13:17:37 #link sorry , it's in Chinese. http://www.dianping.com/shop/508128 13:25:01 #info ZTE team is building addtional doctor pods in ZTE with apex and fuel 13:26:32 #info Ryota proposed to keep current *running* doctor pods in okinawa, until we got new pods in ZTE, so that we can avoid situation where we don't have doctor pod for our verification of new patch sets 13:26:59 +1 13:27:06 +1 13:27:14 +1 13:32:51 #link https://jira.opnfv.org/browse/DOCTOR-108 13:33:44 #info Few bugs have been registered by Umar (DOCTOR-108 to DOCTOR-111). 13:34:17 DOCTOR-108 bug has been resolved 13:37:54 Not sure if Umar is on the call (am not connected to GTM) 13:38:08 Umar is an intern at NEC working with me on Doctor 13:39:11 cgoncalves: Umar is explaining now his findings. Thanks 13:39:32 bertys: perfect. thanks 13:40:43 #info https://jira.opnfv.org/browse/DOCTOR-111 : Umar will take 13:45:39 Umar Farooq proposed doctor: Fix session error with INSPECTOR_TYPE=congress https://gerrit.opnfv.org/gerrit/35529 13:45:54 #topic AoB 14:05:44 r-mibu_: #endmeeting plz :-) 14:17:24 Merged doctor: Fix session error with INSPECTOR_TYPE=congress https://gerrit.opnfv.org/gerrit/35529 04:01:07 Yujun Zhang proposed doctor: Create devstack plugin for osprofiler configuration https://gerrit.opnfv.org/gerrit/35263 04:37:48 Yujun Zhang proposed doctor: Create devstack plugin for osprofiler configuration https://gerrit.opnfv.org/gerrit/35263 04:44:27 Yujun Zhang proposed doctor: Create devstack plugin for osprofiler configuration https://gerrit.opnfv.org/gerrit/35263 06:23:26 wenjuan dong proposed doctor: refactor the monitor https://gerrit.opnfv.org/gerrit/34463 06:46:34 wenjuan dong proposed doctor: refactor the monitor https://gerrit.opnfv.org/gerrit/34463 06:58:00 Merged doctor: Create devstack plugin for osprofiler configuration https://gerrit.opnfv.org/gerrit/35263 07:50:34 Merged doctor: Adding PYTHON_ENABLE option https://gerrit.opnfv.org/gerrit/33841 08:53:05 Yujun Zhang proposed doctor: Remove obsolete packages https://gerrit.opnfv.org/gerrit/35565 09:24:38 Merged doctor: Remove obsolete packages https://gerrit.opnfv.org/gerrit/35565 12:17:51 yujunz: hi 12:17:56 hi 12:18:16 yujunz: apologies I haven't had much time yet to contribute to our slide deck 12:18:46 yujunz: apart from setting up the nodes and openstack, I've been also running a couple of benchmarking runs 12:19:15 OK, benchmarking between which conditions? 12:19:18 different combinations such as: VM_COUNT, INSPECTOR_TYPE (sample/congress) and cherry-picking congress with multi-threading 12:19:47 Got it. What do you mean by cherry-picking? 12:19:51 total notification time and trying to break down to requests that take more time 12:20:13 for instance, nova reset state takes A LOT of time even though it's a HTTP 202 request 12:20:45 OK, which api exactly? 12:20:51 Could you provide the command line? 12:20:58 yujunz: https://github.com/openstack/congress/commit/02ff94adb9bc433549f5b3483f36b2ede19b3614 12:21:20 yujunz: https://git.opnfv.org/doctor/tree/tests/inspector.py#n36 12:21:35 ^ this one 12:22:17 If we can identify it as a bottleneck manually, then we can run osprofiler on this api for manual analysis 12:22:47 I've collected ~2800 results already :) 12:23:07 yujunz: exactly. that would be something really good to profile and present 12:23:14 "POST /v2.1/servers/93c3c344-016c-4bcc-94d5-f677d4e0e151/action HTTP/1.1" status: 202 len: 339 time: 0.3593650 12:23:45 OK, I can take this as high priority and do a manual analysis on devstack first 12:23:51 Currently, because the sample inspector/monitor/consumer does not have osprofiler interface. So there is no way to generate a whole break down automatically 12:24:02 But for nova, keystone, it won't be too difficult 12:24:20 yujunz: would it be possible to consolidate our tests in one platform? 12:24:36 e.g. the baremetal NEC POD you have access to 12:24:57 deployed with devstack stable/ocata 12:25:02 You mean doing test on this platform? 12:25:19 yes 12:25:35 so that we could start collecting data and analyze 12:26:17 tojuvone: ^ 12:27:15 OK. I may debug in local devstack first. When it is clear how to do it, I will draft a guide and collect data on NEC pod for presentation 12:27:37 nova host mark down is quite fast compared to server reset state 12:27:42 e.g. "PUT /v2.1/os-services/force-down HTTP/1.1" status: 200 len: 412 time: 0.0636401 12:27:49 and it's HTTP 200!! 12:28:02 yujunz: cool, thanks 12:28:13 And I noticed that markdown is actually not affecting the whole notification time 12:28:26 It sometimes happen **after** consumer get notified 12:28:32 yujunz: it's so that we don't have data from two different platforms. that might confuse us and the audience 12:28:57 yujunz: that's quite expected 12:29:28 we get the notification because of the changing the state of the VM to 'error', not marking the host down 12:29:46 So we can pay less attention to mark down and focus on reset state 12:29:54 correct 12:30:08 OK 12:30:20 By the way, where did you collect the test data currently? 12:30:31 May I have a peek? 12:31:12 sure! node13:/home/opnfv/doctor/tests/notification_time.txt 12:31:51 format is: $DATETIME $INSPECTOR $VM_COUNT $NOTIFICATION_TIME 12:32:32 last results are from congress with the parallel API calls commit I cherry-picked from congress master branch 12:34:22 yujunz: note that I fixed a couple of bugs (keystone auth, etc that Umar reported to Jira) and slightly modified run.sh 12:34:45 OK. I'll pull the latest version 12:43:15 I saw the data now. Need to take some notes in case I forget the format... 12:44:26 yujunz: if you forget, you can check run.sh#L335 12:44:44 sorry, L342 12:44:45 Where can I find the openrc for the devstack under test? 12:44:56 ~/devstack/openrc 12:45:15 also on node13? 12:45:21 yup 12:45:32 node13 is the controller; node9 is compute 12:46:37 I'm very interested in profiling servers.reset_state API. specially because it returns a http 202 (accepted) rather than 200 (ok/processed) 12:49:15 OK. I'll check it as soon as possible. 12:50:06 much appreciated! 13:13:28 Got something cgoncalves 13:13:52 Check https://openzero-team.github.io/doctor-perf/ for a command line result 13:16:05 yujunz: thanks! \o/ 13:16:14 this is from multiple server reset states? 13:16:20 It is one 13:16:29 Please check from the last but one wsgi call 13:16:40 it took 1521ms??? :O 13:16:42 Which is the reset state action 13:17:02 Yes, as I am running from command line, there is many turns of auth seems 13:17:15 For the api cost 13:17:25 Check the last level 1 wsgi nova 13:17:36 It is 311ms 13:17:59 yup, 311ms 13:18:19 we cannot break it further from there? 13:18:35 nevermind 13:20:03 There is still some gap not covered by profiler 13:20:18 Currently, only API, RPC and database access is covered 13:20:20 exactly! I was just about to say that 13:20:44 osprofiler covers ~150ms but not the remaining time 13:21:15 The reset could be done by patching nova and keystone class 13:21:23 I'll check tomorrow. 13:21:59 I was so surprised with this that I even tried on newton 13:22:09 yujunz: thanks a lot! 13:22:25 with newton, nova took the same time 13:23:26 OK, bye. Going home now 13:24:07 see you 03:48:44 cgoncalves, yujunz Hi, I should be around today. I have been a bit out of office the past week. 05:59:08 cgoncalves we need to restack devstack@node13 to enable profiler. Could you try it to see if the documentation in doctor/devstack is enough for user to follow? 05:59:44 ICYMI profiler for reset state on http://doctor.surge.sh/ 05:59:58 ping tojuvone 06:00:20 yujunz, hi 06:00:50 Just to repost the link of profiler we did manually yesterday http://doctor.surge.sh/ 06:01:18 Note the last 311ms from reset state 06:05:25 so this is all from "reset server state" 06:06:09 from cli command 06:06:28 When I added prints to the API.. 06:06:56 it was under 30ms for what happened before doing notification related stuff 06:07:12 Yes, I believe that is the right way to go 06:07:13 and 100ms for doing that 06:07:25 Why we need to set the state and let controller to send alarm? 06:08:02 we do not 06:08:06 but we do 06:08:34 that is exactly why I made the sample code for sending notification from inspector 06:10:19 the calling of the reset server state as general should not be needed 06:10:54 morning guys.i have just work up. give me 1 hour to get to the office and i will get back to this 06:11:03 force down will be seen in servers API and we can send notif from inspector to have alarm 06:11:26 yujunz: feel free to restart any service 06:11:38 but currently the Doctor way is to have reset server state -> notif -> alarm 06:12:49 btw, currently working with Horizon for our demo. Got the tags showing there :) 06:13:03 tojuvone: correct. it has its reasons and logically it is what makes.more sense doing tht way 06:13:24 OK. Let's discuss when cgoncalves get to the office 06:13:39 Did you get waken up by an alert from IRC, cgoncalves? 06:13:44 we can have a quick call if youre available 06:14:31 yujunz: haha no. wearther is too hot these days 06:14:34 inspector notif just has the downside that alarm might reach user before nova is actually forced down. 06:14:50 tojuvone: exactly 06:15:19 tojuvone: thing is this only happens to sercer reset state 06:15:32 cgoncalves, yujunz Yes, let's continue later when cgoncalves at office 06:15:41 OK, I will restack node13 to enable profiler. Hope everything goes on well. 06:15:55 and we cannot speed up notif because there is no instanceupdate.start event 06:16:13 whereas other api calls have 06:16:26 Then I shall decide whether go deep into reset state or continue on doctor profiler 06:16:59 * cgoncalves disconnects from.mobile.ssh+screen+irssi now :) 07:08:56 yujunz, tojuvone: I'm back 07:10:02 cgoncalves, welcome :) I need to go in 3mins or so. will be back in 1½h 07:10:58 tojuvone: no problem 07:11:52 yujunz: have you restacked? otherwise I can do it now 07:11:59 Done 07:12:26 Writing some instruction to run osprofiler 07:12:30 The default hint from osprofiler-cli is somehow misleading 07:12:51 have you created nova cell and discover host? 07:13:05 No 07:13:41 I just ran devstack. Not quite clear about the post setup steps... 07:13:59 ok, it is needed in ocata+devstack 07:14:21 I'll restack node9 and run pos-stack 07:14:26 Any instructions ? 07:14:56 $ nova-manage cell_v2 create_cell --verbose --name cell1 07:15:02 and after stacking all computes: 07:15:10 $ nova-manage cell_v2 discover_hosts 07:18:55 Could you take it over? I'm not quite familiar with it, to be honest... 07:19:45 sure, doing it now 07:20:28 yujunz: done 07:20:37 you can run osprofiler now 07:21:30 Thanks 07:21:53 Shall we have a discussion on the priority? cgoncalves tojuvone 07:22:16 +1 07:22:58 Currently, I have two major tasks. First is add profiler support for doctor sample inspector/monitor/consumer so we can have a whole view of the fault management 07:23:22 Second is go deep in nova reset state which seems to be the current bottleneck 07:24:26 It seems whether reset state is required or not is still under discussion. So maybe I can work on doctor profiler first? 07:25:18 yujunz: for me it is measuring total notification time with different combinations (# vms, inspectors, source for triggering alarm) 07:25:22 yujunz: yes 07:25:55 I also still need to extend congress to send event notification to the bus 07:26:26 I've already tested the congress parallel actions: working 07:26:32 OK. "source for triggering alarm" you mean set state vs send alarm from inspector ? 07:27:10 and I'd like to try using the neutron port data plane status API and compare 07:27:35 yujunz: yes 07:27:55 conceptually speaking, sending triggering alarm from inspector instead of from controller (nova/neutron/cinder/...) 07:28:08 OK. Sounds good that we don't have much overlap :-) 07:28:11 so I also still have lots to cover 07:28:20 right :) 07:28:36 problem is if/how we share the same platform 07:29:09 I'd perhaps recommend each of us work on their tasks in their own PODs and then integrate in one POD 07:29:32 Works for me 07:30:12 great. you can use NEC POD for the next couple of hours if you want. I need to take care of other things this morning 07:31:03 OK. I added local.conf under version control on NEC pod 07:31:16 So that we can track the changes that has been made 07:36:30 great, thanks 07:37:11 yujunz: why you enabled panko? 07:37:33 It is the default backend for storing profiler data 07:37:51 Although I switched it to redis, it is still required in installation 07:38:01 switched backend to redis 07:38:05 ah ok 07:38:25 you mentioned that in your email but I had forgotten 07:38:41 I'll put some comments in local.conf :-) 08:41:44 cgoncalves, yujunz back. So I have just worked on Horizon for the demo, but got the tags support now. Then there is the piece of code for having notif from inspector and alarm from that. 08:43:01 OK. I think I saw the code piece in email. Maybe you can push it to a personal git repo so we can fetch and use if needed ? 08:43:46 yujunz, yes, I can look into that 08:45:46 just the notif from inspector has hard coding and manual step still to change one yaml file. 08:47:53 ok 08:48:59 yujunz: I missed doctor meeting this week. I read the minutes and seems that wenjuan is proposing now refactoring the tests in ansible? 08:49:23 Using the python modules in ansible, exactly 08:50:08 so what happened to the python effort? 08:50:26 Most operations are well supported in ansible 08:50:28 And just lack alarm module 08:51:13 hmm. was there any decison made? 08:51:23 otherwise I'd propose discussing that in Beijing 08:51:56 No, we are still investigating the pros and cons and bring it to the summit 08:51:59 and re-titling the design session 'overview of python refactor' slightly differently 08:52:04 ok, great 08:52:29 Yes, that's the idea. Instead of pointing out the solution. Let's have a look at what problem we are trying to solve 08:52:36 And what is the best solution 08:52:48 couldn't agree more with you! 08:53:43 leveraging tempest could also be a good idea 08:58:46 I haven't look into tempest too much yet. Will you give some insights about tempest in the summit session? 09:06:21 I'm also not an expert but I can say a few words yes 09:06:36 Ryota should be more experienced in that regard, I guess 09:07:05 it could also be a mix of ansible+tempest. let's discuss that in 2 weeks :) 09:09:01 Sounds good. Looking forward for the discussion 03:44:44 put the "inspector notification" stuff here: https://github.com/tojuvone/doctor-notif 13:02:22 r-mibu: Error: Can't start another meeting, one is in progress. Use #endmeeting first. 13:02:30 #endmeeting