#opnfv-doctor log

12:59:37 <r-mibu_> #startmeeting doctor
12:59:37 <collabot`> Meeting started Tue May 30 12:59:37 2017 UTC.  The chair is r-mibu_. Information about MeetBot at http://wiki.debian.org/MeetBot.
12:59:37 <collabot`> Useful Commands: #action #agreed #help #info #idea #link #topic.
12:59:37 <collabot`> The meeting name has been set to 'doctor'
12:59:45 <r-mibu_> #topic roll call
13:00:03 <cgoncalves> I'm stuck in a meeting. should be free in 30 minutes and connect to GTM
13:01:24 <dwj> #info dwj
13:01:24 <r-mibu_> #info Ryota Mibu
13:01:44 <bertys> #info Bertrand Souville
13:02:56 <r-mibu_> #link https://etherpad.opnfv.org/p/doctor_meetings
13:04:34 <r-mibu_> #info OPNFV Summit
13:05:20 <GeraldK> #info Gerald Kunzmann
13:11:15 <dwj> #info https://github.com/ansible/ansible/tree/devel/lib/ansible/modules/cloud/openstack
13:14:12 <GeraldK> dwj, can you propose a good roast duck restaurant?
13:16:09 <dwj> sure, wait a moment
13:16:21 <r-mibu_> #topic E release status
13:17:10 <r-mibu_> #link https://wiki.opnfv.org/display/doctor/Euphrates+Planning
13:17:37 <dwj> #link  sorry , it's in Chinese.  http://www.dianping.com/shop/508128
13:25:01 <r-mibu_> #info ZTE team is building addtional doctor pods in ZTE with apex and fuel
13:26:32 <r-mibu_> #info Ryota proposed to keep current *running* doctor pods in okinawa, until we got new pods in ZTE, so that we can avoid situation where we don't have doctor pod for our verification of new patch sets
13:26:59 <bertys> +1
13:27:06 <GeraldK> +1
13:27:14 <dwj> +1
13:32:51 <r-mibu_> #link https://jira.opnfv.org/browse/DOCTOR-108
13:33:44 <bertys> #info Few bugs have been registered by Umar (DOCTOR-108 to DOCTOR-111).
13:34:17 <bertys> DOCTOR-108 bug has been resolved
13:37:54 <cgoncalves> Not sure if Umar is on the call (am not connected to GTM)
13:38:08 <cgoncalves> Umar is an intern at NEC working with me on Doctor
13:39:11 <bertys> cgoncalves: Umar is explaining now his findings. Thanks
13:39:32 <cgoncalves> bertys: perfect. thanks
13:40:43 <r-mibu_> #info https://jira.opnfv.org/browse/DOCTOR-111 : Umar will take
13:45:39 <OPNFV-Gerrit-Bot> Umar Farooq proposed doctor: Fix session error with INSPECTOR_TYPE=congress  https://gerrit.opnfv.org/gerrit/35529
13:45:54 <r-mibu_> #topic AoB
14:05:44 <cgoncalves> r-mibu_: #endmeeting plz :-)
14:17:24 <OPNFV-Gerrit-Bot> Merged doctor: Fix session error with INSPECTOR_TYPE=congress  https://gerrit.opnfv.org/gerrit/35529
04:01:07 <OPNFV-Gerrit-Bot> Yujun Zhang proposed doctor: Create devstack plugin for osprofiler configuration  https://gerrit.opnfv.org/gerrit/35263
04:37:48 <OPNFV-Gerrit-Bot> Yujun Zhang proposed doctor: Create devstack plugin for osprofiler configuration  https://gerrit.opnfv.org/gerrit/35263
04:44:27 <OPNFV-Gerrit-Bot> Yujun Zhang proposed doctor: Create devstack plugin for osprofiler configuration  https://gerrit.opnfv.org/gerrit/35263
06:23:26 <OPNFV-Gerrit-Bot> wenjuan dong proposed doctor: refactor the monitor  https://gerrit.opnfv.org/gerrit/34463
06:46:34 <OPNFV-Gerrit-Bot> wenjuan dong proposed doctor: refactor the monitor  https://gerrit.opnfv.org/gerrit/34463
06:58:00 <OPNFV-Gerrit-Bot> Merged doctor: Create devstack plugin for osprofiler configuration  https://gerrit.opnfv.org/gerrit/35263
07:50:34 <OPNFV-Gerrit-Bot> Merged doctor: Adding PYTHON_ENABLE option  https://gerrit.opnfv.org/gerrit/33841
08:53:05 <OPNFV-Gerrit-Bot> Yujun Zhang proposed doctor: Remove obsolete packages  https://gerrit.opnfv.org/gerrit/35565
09:24:38 <OPNFV-Gerrit-Bot> Merged doctor: Remove obsolete packages  https://gerrit.opnfv.org/gerrit/35565
12:17:51 <cgoncalves> yujunz: hi
12:17:56 <yujunz> hi
12:18:16 <cgoncalves> yujunz: apologies I haven't had much time yet to contribute to our slide deck
12:18:46 <cgoncalves> yujunz: apart from setting up the nodes and openstack, I've been also running a couple of benchmarking runs
12:19:15 <yujunz> OK, benchmarking between which conditions?
12:19:18 <cgoncalves> different combinations such as: VM_COUNT, INSPECTOR_TYPE (sample/congress) and cherry-picking congress with multi-threading
12:19:47 <yujunz> Got it. What do you mean by cherry-picking?
12:19:51 <cgoncalves> total notification time and trying to break down to requests that take more time
12:20:13 <cgoncalves> for instance, nova reset state takes A LOT of time even though it's a HTTP 202 request
12:20:45 <yujunz> OK, which api exactly?
12:20:51 <yujunz> Could you provide the command line?
12:20:58 <cgoncalves> yujunz: https://github.com/openstack/congress/commit/02ff94adb9bc433549f5b3483f36b2ede19b3614
12:21:20 <cgoncalves> yujunz: https://git.opnfv.org/doctor/tree/tests/inspector.py#n36
12:21:35 <cgoncalves> ^ this one
12:22:17 <yujunz> If we can identify it as a bottleneck manually, then we can run osprofiler on this api for manual analysis
12:22:47 <cgoncalves> I've collected ~2800 results already :)
12:23:07 <cgoncalves> yujunz: exactly. that would be something really good to profile and present
12:23:14 <cgoncalves> "POST /v2.1/servers/93c3c344-016c-4bcc-94d5-f677d4e0e151/action HTTP/1.1" status: 202 len: 339 time: 0.3593650
12:23:45 <yujunz> OK, I can take this as high priority and do a manual analysis on devstack first
12:23:51 <yujunz> Currently, because the sample inspector/monitor/consumer does not have osprofiler interface. So there is no way to generate a whole break down automatically
12:24:02 <yujunz> But for nova, keystone, it won't be too difficult
12:24:20 <cgoncalves> yujunz: would it be possible to consolidate our tests in one platform?
12:24:36 <cgoncalves> e.g. the baremetal NEC POD you have access to
12:24:57 <cgoncalves> deployed with devstack stable/ocata
12:25:02 <yujunz> You mean doing test on this platform?
12:25:19 <cgoncalves> yes
12:25:35 <cgoncalves> so that we could start collecting data and analyze
12:26:17 <cgoncalves> tojuvone: ^
12:27:15 <yujunz> OK. I may debug in local devstack first. When it is clear how to do it, I will draft a guide and collect data on NEC pod for presentation
12:27:37 <cgoncalves> nova host mark down is quite fast compared to server reset state
12:27:42 <cgoncalves> e.g. "PUT /v2.1/os-services/force-down HTTP/1.1" status: 200 len: 412 time: 0.0636401
12:27:49 <cgoncalves> and it's HTTP 200!!
12:28:02 <cgoncalves> yujunz: cool, thanks
12:28:13 <yujunz> And I noticed that markdown is actually not affecting the whole notification time
12:28:26 <yujunz> It sometimes happen **after** consumer get notified
12:28:32 <cgoncalves> yujunz: it's so that we don't have data from two different platforms. that might confuse us and the audience
12:28:57 <cgoncalves> yujunz: that's quite expected
12:29:28 <cgoncalves> we get the notification because of the changing the state of the VM to 'error', not marking the host down
12:29:46 <yujunz> So we can pay less attention to mark down and focus on reset state
12:29:54 <cgoncalves> correct
12:30:08 <yujunz> OK
12:30:20 <yujunz> By the way, where did you collect the test data currently?
12:30:31 <yujunz> May I have a peek?
12:31:12 <cgoncalves> sure! node13:/home/opnfv/doctor/tests/notification_time.txt
12:31:51 <cgoncalves> format is: $DATETIME $INSPECTOR $VM_COUNT $NOTIFICATION_TIME
12:32:32 <cgoncalves> last results are from congress with the parallel API calls commit I cherry-picked from congress master branch
12:34:22 <cgoncalves> yujunz: note that I fixed a couple of bugs (keystone auth, etc that Umar reported to Jira) and slightly modified run.sh
12:34:45 <yujunz> OK. I'll pull the latest version
12:43:15 <yujunz> I saw the data now. Need to take some notes in case I forget the format...
12:44:26 <cgoncalves> yujunz: if you forget, you can check run.sh#L335
12:44:44 <cgoncalves> sorry, L342
12:44:45 <yujunz> Where can I find the openrc for the devstack under test?
12:44:56 <cgoncalves> ~/devstack/openrc
12:45:15 <yujunz> also on node13?
12:45:21 <cgoncalves> yup
12:45:32 <cgoncalves> node13 is the controller; node9 is compute
12:46:37 <cgoncalves> I'm very interested in profiling servers.reset_state API. specially because it returns a http 202 (accepted) rather than 200 (ok/processed)
12:49:15 <yujunz> OK. I'll check it as soon as possible.
12:50:06 <cgoncalves> much appreciated!
13:13:28 <yujunz> Got something cgoncalves
13:13:52 <yujunz> Check https://openzero-team.github.io/doctor-perf/ for a command line result
13:16:05 <cgoncalves> yujunz: thanks! \o/
13:16:14 <cgoncalves> this is from multiple server reset states?
13:16:20 <yujunz> It is one
13:16:29 <yujunz> Please check from the last but one wsgi call
13:16:40 <cgoncalves> it took 1521ms??? :O
13:16:42 <yujunz> Which is the reset state action
13:17:02 <yujunz> Yes, as I am running from command line, there is many turns of auth seems
13:17:15 <yujunz> For the api cost
13:17:25 <yujunz> Check the last level 1 wsgi nova
13:17:36 <yujunz> It is 311ms
13:17:59 <cgoncalves> yup, 311ms
13:18:19 <cgoncalves> we cannot break it further from there?
13:18:35 <cgoncalves> nevermind
13:20:03 <yujunz> There is still some gap not covered by profiler
13:20:18 <yujunz> Currently, only API, RPC and database access is covered
13:20:20 <cgoncalves> exactly! I was just about to say that
13:20:44 <cgoncalves> osprofiler covers ~150ms but not the remaining time
13:21:15 <yujunz> The reset could be done by patching nova and keystone class
13:21:23 <yujunz> I'll check tomorrow.
13:21:59 <cgoncalves> I was so surprised with this that I even tried on newton
13:22:09 <cgoncalves> yujunz: thanks a lot!
13:22:25 <cgoncalves> with newton, nova took the same time
13:23:26 <yujunz> OK, bye. Going home now
13:24:07 <cgoncalves> see you
03:48:44 <tojuvone> cgoncalves, yujunz Hi, I should be around today. I have been a bit out of office the past week.
05:59:08 <yujunz> cgoncalves we need to restack devstack@node13 to enable profiler. Could you try it to see if the documentation in doctor/devstack is enough for user to follow?
05:59:44 <yujunz> ICYMI profiler for reset state on http://doctor.surge.sh/
05:59:58 <yujunz> ping tojuvone
06:00:20 <tojuvone> yujunz, hi
06:00:50 <yujunz> Just to repost the link of profiler we did manually yesterday http://doctor.surge.sh/
06:01:18 <yujunz> Note the last 311ms from reset state
06:05:25 <tojuvone> so this is all from "reset server state"
06:06:09 <yujunz> from cli command
06:06:28 <tojuvone> When I added prints to the API..
06:06:56 <tojuvone> it was under 30ms for what happened before doing notification related stuff
06:07:12 <yujunz> Yes, I believe that is the right way to go
06:07:13 <tojuvone> and 100ms for doing that
06:07:25 <yujunz> Why we need to set the state and let controller to send alarm?
06:08:02 <tojuvone> we do not
06:08:06 <tojuvone> but we do
06:08:34 <tojuvone> that is exactly why I made the sample code for sending notification from inspector
06:10:19 <tojuvone> the calling of the reset server state as general should not be needed
06:10:54 <cgoncalves> morning guys.i have just work up. give me 1 hour to get to the office and i will get back to this
06:11:03 <tojuvone> force down will be seen in servers API and we can send notif from inspector to have alarm
06:11:26 <cgoncalves> yujunz: feel free to restart any service
06:11:38 <tojuvone> but currently the Doctor way is to have reset server state -> notif -> alarm
06:12:49 <tojuvone> btw, currently working with Horizon for our demo. Got the tags showing there :)
06:13:03 <cgoncalves> tojuvone: correct. it has its reasons and logically it is what makes.more sense doing tht way
06:13:24 <yujunz> OK. Let's discuss when cgoncalves get to the office
06:13:39 <yujunz> Did you get waken up by an alert from IRC, cgoncalves?
06:13:44 <cgoncalves> we can have a quick call if youre available
06:14:31 <cgoncalves> yujunz: haha no. wearther is too hot these days
06:14:34 <tojuvone> inspector notif just has the downside that alarm might reach user before nova is actually forced down.
06:14:50 <cgoncalves> tojuvone: exactly
06:15:19 <cgoncalves> tojuvone: thing is this only happens to sercer reset state
06:15:32 <tojuvone> cgoncalves, yujunz Yes, let's continue later when cgoncalves at office
06:15:41 <yujunz> OK, I will restack node13 to enable profiler. Hope everything goes on well.
06:15:55 <cgoncalves> and we cannot speed up notif because there is no instanceupdate.start event
06:16:13 <cgoncalves> whereas other api calls have
06:16:26 <yujunz> Then I shall decide whether go deep into reset state or continue on doctor profiler
06:16:59 * cgoncalves disconnects from.mobile.ssh+screen+irssi now :)
07:08:56 <cgoncalves> yujunz, tojuvone: I'm back
07:10:02 <tojuvone> cgoncalves, welcome :) I need to go in 3mins or so. will be back in 1½h
07:10:58 <cgoncalves> tojuvone: no problem
07:11:52 <cgoncalves> yujunz: have you restacked? otherwise I can do it now
07:11:59 <yujunz> Done
07:12:26 <yujunz> Writing some instruction to run osprofiler
07:12:30 <yujunz> The default hint from osprofiler-cli is somehow misleading
07:12:51 <cgoncalves> have you created nova cell and discover host?
07:13:05 <yujunz> No
07:13:41 <yujunz> I just ran devstack. Not quite clear about the post setup steps...
07:13:59 <cgoncalves> ok, it is needed in ocata+devstack
07:14:21 <cgoncalves> I'll restack node9 and run pos-stack
07:14:26 <yujunz> Any instructions ?
07:14:56 <cgoncalves> <controller> $ nova-manage cell_v2 create_cell --verbose --name cell1
07:15:02 <cgoncalves> and after stacking all computes:
07:15:10 <cgoncalves> <controller> $ nova-manage cell_v2 discover_hosts
07:18:55 <yujunz> Could you take it over? I'm not quite familiar with it, to be honest...
07:19:45 <cgoncalves> sure, doing it now
07:20:28 <cgoncalves> yujunz: done
07:20:37 <cgoncalves> you can run osprofiler now
07:21:30 <yujunz> Thanks
07:21:53 <yujunz> Shall we have a discussion on the priority? cgoncalves tojuvone
07:22:16 <cgoncalves> +1
07:22:58 <yujunz> Currently, I have two major tasks. First is add profiler support for doctor sample inspector/monitor/consumer so we can have a whole view of the fault management
07:23:22 <yujunz> Second is go deep in nova reset state which seems to be the current bottleneck
07:24:26 <yujunz> It seems whether reset state is required or not is still under discussion. So maybe I can work on doctor profiler first?
07:25:18 <cgoncalves> yujunz: for me it is measuring total notification time with different combinations (# vms, inspectors, source for triggering alarm)
07:25:22 <cgoncalves> yujunz: yes
07:25:55 <cgoncalves> I also still need to extend congress to send event notification to the bus
07:26:26 <cgoncalves> I've already tested the congress parallel actions: working
07:26:32 <yujunz> OK. "source for triggering alarm" you mean set state vs send alarm from inspector ?
07:27:10 <cgoncalves> and I'd like to try using the neutron port data plane status API and compare
07:27:35 <cgoncalves> yujunz: yes
07:27:55 <cgoncalves> conceptually speaking, sending triggering alarm from inspector instead of from controller (nova/neutron/cinder/...)
07:28:08 <yujunz> OK. Sounds good that we don't have much overlap :-)
07:28:11 <cgoncalves> so I also still have lots to cover
07:28:20 <cgoncalves> right :)
07:28:36 <cgoncalves> problem is if/how we share the same platform
07:29:09 <cgoncalves> I'd perhaps recommend each of us work on their tasks in their own PODs and then integrate in one POD
07:29:32 <yujunz> Works for me
07:30:12 <cgoncalves> great. you can use NEC POD for the next couple of hours if you want. I need to take care of other things this morning
07:31:03 <yujunz> OK. I added local.conf under version control on NEC pod
07:31:16 <yujunz> So that we can track the changes that has been made
07:36:30 <cgoncalves> great, thanks
07:37:11 <cgoncalves> yujunz: why you enabled panko?
07:37:33 <yujunz> It is the default backend for storing profiler data
07:37:51 <yujunz> Although I switched it to redis, it is still required in installation
07:38:01 <yujunz> switched backend to redis
07:38:05 <cgoncalves> ah ok
07:38:25 <cgoncalves> you mentioned that in your email but I had forgotten
07:38:41 <yujunz> I'll put some comments in local.conf :-)
08:41:44 <tojuvone> cgoncalves, yujunz back. So I have just worked on Horizon for the demo, but got the tags support now. Then there is the piece of code for having notif from inspector and alarm from that.
08:43:01 <yujunz> OK. I think I saw the code piece in email. Maybe you can push it to a personal git repo so we can fetch and use if needed ?
08:43:46 <tojuvone> yujunz, yes, I can look into that
08:45:46 <tojuvone> just the notif from inspector has hard coding and manual step still to change one yaml file.
08:47:53 <cgoncalves> ok
08:48:59 <cgoncalves> yujunz: I missed doctor meeting this week. I read the minutes and seems that wenjuan is proposing now refactoring the tests in ansible?
08:49:23 <yujunz> Using the python modules in ansible, exactly
08:50:08 <cgoncalves> so what happened to the python effort?
08:50:26 <yujunz> Most operations are well supported in ansible
08:50:28 <yujunz> And just lack alarm module
08:51:13 <cgoncalves> hmm. was there any decison made?
08:51:23 <cgoncalves> otherwise I'd propose discussing that in Beijing
08:51:56 <yujunz> No, we are still investigating the pros and cons and bring it to the summit
08:51:59 <cgoncalves> and re-titling the design session 'overview of python refactor' slightly differently
08:52:04 <cgoncalves> ok, great
08:52:29 <yujunz> Yes, that's the idea. Instead of pointing out the solution. Let's have a look at what problem we are trying to solve
08:52:36 <yujunz> And what is the best solution
08:52:48 <cgoncalves> couldn't agree more with you!
08:53:43 <cgoncalves> leveraging tempest could also be a good idea
08:58:46 <yujunz> I haven't look into tempest too much yet. Will you give some insights about tempest in the summit session?
09:06:21 <cgoncalves> I'm also not an expert but I can say a few words yes
09:06:36 <cgoncalves> Ryota should be more experienced in that regard, I guess
09:07:05 <cgoncalves> it could also be a mix of ansible+tempest. let's discuss that in 2 weeks :)
09:09:01 <yujunz> Sounds good. Looking forward for the discussion
03:44:44 <tojuvone> put the "inspector notification" stuff here: https://github.com/tojuvone/doctor-notif
13:02:22 <collabot> r-mibu: Error: Can't start another meeting, one is in progress.  Use #endmeeting first.
13:02:30 <r-mibu> #endmeeting