#opnfv-pharos log

15:58:57 <trevor_intel1> #startmeeting OPNFV Pharos
15:58:57 <collabot> Meeting started Wed Feb 17 15:58:57 2016 UTC.  The chair is trevor_intel1. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:58:57 <collabot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:58:57 <collabot> The meeting name has been set to 'opnfv_pharos'
15:59:07 <trevor_intel1> #info Trevor Cooper
15:59:54 <trevor_intel1> #topic roll call
16:01:33 <trevor_intel1> jose_lausuch: ping
16:01:42 <trevor_intel1> narindergupta: ping
16:01:49 <narindergupta> #info Narinder Gupta
16:01:51 <jose_lausuch> trevor_intel1: pong
16:02:18 <trevor_intel1> Hey guys ... can we get going?
16:02:36 <trevor_intel1> #topic JOID testing
16:02:57 <jmorgan1> #info Jack Morgan
16:03:00 <jose_lausuch> is there a gotomeeting?
16:03:04 <narindergupta> trevor_intel we are seeing some performance issue in pod5 and pod6
16:03:04 <jose_lausuch> or irc only
16:03:16 <trevor_intel1> jose_lausuch: IRC only
16:03:18 <jose_lausuch> ok
16:03:23 <jose_lausuch> ya, as narindergupta said
16:03:28 <narindergupta> so there was a proposal to make orange pod2 as ci pod
16:03:32 <jose_lausuch> something is wrong there
16:03:41 <trevor_intel1> jose_lausuch: are the test failures consistent between POD 5 and 6?
16:03:43 <narindergupta> until we figure out issue in lab
16:03:43 <jose_lausuch> I need help to detect what causes the problems
16:04:08 <jose_lausuch> trevor_intel1 the problem we detected is that the jjob was timing out
16:04:09 <narindergupta> bit different but takes almost same time
16:04:17 <jose_lausuch> then we increased the functest jjob timeout to 400min
16:04:25 <jose_lausuch> and the job finished,but it took 5 hrs
16:04:38 <jose_lausuch> I checked and rally was taking too much time to execute cinder tests
16:04:53 <trevor_intel1> narindergupta: I saw this comment ... Narinder suggested that it could be due that on intel pod5 and pod6 we do not have extra hard disk so we are using the os disk
16:04:57 <jose_lausuch> narindergupta thinks it could be due to lack of disks in those servers
16:05:01 <jose_lausuch> yes
16:05:05 <jose_lausuch> that could be a reason
16:05:21 <jose_lausuch> but also, the authentication tests (keystone) took much longer than expected
16:05:23 <trevor_intel1> jose_lausuch: But this is a standard POD
16:05:31 <narindergupta> trevor_intel i am concerned about temptest failures as well
16:05:48 <jose_lausuch> we got many more tempest failures than for orange-pod2 (almost nothing)
16:05:49 <jose_lausuch> so
16:06:02 <narindergupta> as we are seeing only 20% test case passes but for orange pod 98% passes
16:06:05 <trevor_intel1> jmorgan1: how much storage on POD 5 and 6?
16:06:07 <jose_lausuch> we could see again if we have some network issues or something
16:06:09 <narindergupta> with same code of joid
16:06:27 <jose_lausuch> that is not morgan orange :)
16:06:36 <jose_lausuch> morgan is on vacation this week
16:06:40 <trevor_intel1> jose_lausuch: oops :(
16:06:49 <jmorgan1> trevor_intel1: they are pharos complient minus the ssd
16:07:03 <trevor_intel1> jmorgan1: how much storage on POD 5 and 6?
16:07:22 <jmorgan1> trevor_intel1: 2TB
16:07:53 <trevor_intel1> jmorgan1: ok
16:08:06 <trevor_intel1> jose_lausuch: how much storage on Orage POD?
16:08:11 <jose_lausuch> jmorgan1:are the NICs 1 or 10Gb?
16:08:19 <trevor_intel1> jose_lausuch: Does Orange POD use SSD?
16:08:19 <jmorgan1> jose_lausuch: both
16:08:22 <jose_lausuch> trevor_intel1:  dont know, as I dont have access to that pod
16:08:41 <narindergupta> trevor_intel yes they have additional ssd per machine
16:08:54 <narindergupta> and we are using that for ceph block storage
16:09:21 <jose_lausuch> could that also cause the tempest problems? there are some related to cinder as well
16:09:22 <narindergupta> jose_lausuch: trevor_intel see my comment
16:09:42 <trevor_intel1> jmorgan1: Is it easy to add an SSD i.e. readily available and easy to configure?
16:10:02 <jmorgan1> trevor_intel1: i don't know, would have to lok
16:10:08 <jose_lausuch> at least on pod5
16:10:25 <jose_lausuch> I dont know 100% if that would solve the problem, but worth to try...
16:10:28 <jmorgan1> but i don't understand, if a pod doesn't have an ssd your project fails?
16:10:46 <jmorgan1> jose_lausuch: its a lot of money to just guess
16:10:52 <jose_lausuch> yap
16:11:00 <jose_lausuch> storage issues
16:11:08 <trevor_intel1> jose_lausuch: I am reluctant to just jump to a different POD without any attempt to debug IMO that is a bad precedence
16:11:11 <jmorgan1> jose_lausuch: what about the perfomrance issue? how is it related to networking?
16:11:42 <jose_lausuch> trevor_intel1: +1 if were not 1 week before releasing :)
16:11:44 <jmorgan1> jose_lausuch: for example, between nodes or between nodes and external system on internet?
16:11:45 <narindergupta> trevor_intel we will continue debugging as release next week
16:12:08 <jose_lausuch> jmorgan1: I dont know yet
16:12:09 <narindergupta> so that we can consider an another option until B release.
16:12:37 <trevor_intel1> narindergupta: not sure I understand
16:12:40 <jose_lausuch> trevor_intel1: please look at the 2 last tables https://wiki.opnfv.org/functextnexttaks#brahmaputra_summary
16:12:52 <jose_lausuch> big differences between joid on different pods
16:12:54 <jose_lausuch> my question is
16:12:57 <jose_lausuch> what do we do here?
16:13:07 <jose_lausuch> troubleshoot, of course
16:13:18 <jose_lausuch> but...
16:13:53 <narindergupta> trevor_intel what i meant it we have release next year and with the pod we are not even close to 90% in intel pods. we can keep debugging the issue but if we can consider orange pod just for release would be fantastic.
16:14:17 <jose_lausuch> +1
16:14:18 <narindergupta> and we will keep continue troubleshooting the intel pods for performance issues
16:14:35 <trevor_intel1> narindergupta: but there has been no debugging or did as I misunderstand?
16:14:47 <jose_lausuch> trevor_intel1: I started yesterday
16:14:53 <narindergupta> jose_lausuch: was working with on debuggin on it
16:15:26 <jose_lausuch> debuggint on it
16:15:28 <narindergupta> as found out increasing timeout passes more test but temptest are still failing for some reason and we saw stange behavior
16:17:21 <trevor_intel1> jose_lausuch: If we were able to add a SSD today hwo easy/qucik to tell if that fixes it?
16:17:24 <narindergupta> trevor_intel and we are testing on daily basis and few issue resolved during the process as well. so debugging continued
16:17:51 <jose_lausuch> trevor_intel: narindergupta would have to configure MAAS to use them somehow, so a deployment+functest
16:18:12 <narindergupta> trevor_intel correct would come to know within a day
16:18:45 <trevor_intel1> narindergupta: do you have a day?
16:18:45 <trevor_intel1> narindergupta: given the circumstances
16:18:45 <jose_lausuch> jmorgan1: can you check the jumphost on intel pod 5?
16:18:48 <jose_lausuch> we have iptables rules rejecting icmp...
16:19:17 <narindergupta> we are running again very tight timeline but one day is ok
16:19:57 <narindergupta> jose_lausuch: but need to decide fast as next week i will be out and i want to finish the decision asap.
16:20:12 <narindergupta> jose_lausuch: is it something new you found?
16:20:36 <trevor_intel1> jmorgan1: by when can you find out re. SSD?
16:20:41 <jose_lausuch> narindergupta: not sure if its relevant, but we had issues with wrong iptables rules in lf-pod1 for apex..
16:21:16 <jmorgan1> jose_lausuch: i'm not sure what you expect me to check? we install the base OS and the pod users takes over from there
16:21:27 <jmorgan1> trevor_intel1: I'm not sure aht you are talking about
16:21:41 <trevor_intel1> jmorgan1: adding an SSD
16:21:59 <jmorgan1> trevor_intel1: so waht do you want me to find out?
16:22:47 <trevor_intel1> narindergupta: is SSD needed on all nodes?
16:23:21 <narindergupta> trevor_intel no only on compute nodes
16:23:32 <trevor_intel1> narindergupta: so 2 of the nodes?
16:24:01 <narindergupta> and we use two replicas atleast so 2 per compute nodes and total 4 in two nodes
16:24:12 <narindergupta> correcty
16:24:44 <trevor_intel1> narindergupta: that what is in Orange POD?
16:25:01 <narindergupta> orange pod all nodes has ssd
16:25:10 <trevor_intel1> narindergupta: in the spec we say 1x 100GB SSD
16:25:14 <narindergupta> attached including the controller nodes
16:25:22 <trevor_intel1> narindergupta: but x2 in each node?
16:26:20 <narindergupta> replicas are two in ceph and ornage pod has two in computes and one in control.
16:26:40 <narindergupta> but requirement is only in compute as per my bundle
16:27:00 <trevor_intel1> narindergupta: ok so seems we need 2 SSD in each compute node for a total of 4 (as you said earlier)
16:27:21 <narindergupta> correct
16:27:53 <trevor_intel1> jmorgan1: the question is how quickly can you find and attach 4 SSDs to POD 5?
16:28:22 <trevor_intel1> jmorgan1: This is a debugging exercise so could be borrowed
16:28:37 <narindergupta> jmorgan1: correct if not today then we do not have time to look into it further. As i am traveling to Barcelona for MWC 2016 on Sat.
16:28:59 <fdegir> narindergupta: how many scenarios do you plan to release next week
16:29:05 <fdegir> 3?
16:30:08 <fdegir> it seems it takes up to 4 hours per run on orange pod
16:30:22 <jmorgan1> narindergupta: so if we add ssd today, you wont be able to properly troubleshoot this then?
16:30:26 <narindergupta> fdegir: 3 yes
16:30:28 <fdegir> 4 runs x 3 scenarios x 4 hours = 48 hours
16:30:42 <narindergupta> odl_l2, nosdn, onos
16:30:43 <fdegir> then we have almost that much left before you leave for MWC
16:30:56 <jose_lausuch> fdegir: assuming that adding SSDs solves the time problem...
16:31:11 <fdegir> I'm not assuming anything
16:31:15 <fdegir> just doing the math
16:31:18 <jose_lausuch> :)
16:31:34 <jmorgan1> trevor_intel1: i would have to drop everything else on my todo list to go investigate this
16:31:36 <fdegir> and 48 hours is for if everything goes totally fine
16:31:54 <narindergupta> trevor_intel it seems time wise its really tight
16:31:54 <fdegir> and narindergupta needs to do some documentation I suppose
16:31:57 <trevor_intel1> Fatih: So if we declare Orange POD production stable ... they are already done?
16:32:08 <fdegir> trevor_intel1: don't know the details
16:32:25 <fdegir> just pointing the obvious
16:32:28 <narindergupta> fdegir: yeah i started looking into it. unfortunately i can not replicate myself :(
16:32:35 <jose_lausuch> trevor_intel1 I think we need to enalble more scenarios, only odl-l2 running I've seen
16:32:39 <jose_lausuch> but maybe I'm wrong
16:32:48 <trevor_intel1> jose_lausuch: So if we declare Orange POD production stable ... Joid is aleady good to release?
16:32:53 <narindergupta> jose_lausuch: we have nosdn ran once
16:33:04 <jose_lausuch> but looking at this: https://build.opnfv.org/ci/job/joid-deploy-orange-pod2-daily-master/
16:33:06 <narindergupta> jose_lausuch: today onos is also running
16:33:06 <trevor_intel1> jose_lausuch: ok I see
16:33:14 <jose_lausuch> trevor_intel1 not yet
16:35:28 <trevor_intel1> narindergupta: Fatih's math seems to point there is no slack (not even 1 day)?
16:35:53 <narindergupta> trevor_intel yerah specially i am leaving for MWC this Saturday
16:37:10 <narindergupta> trevor_intel i did not do any match
16:37:27 <trevor_intel1> Fatih: seems this is a releng decision?
16:37:35 <fdegir> :)
16:37:43 <trevor_intel1> Fatih: glad to see a smile!
16:37:52 <fdegir> it should be joint decision between pharos and releng
16:38:32 <trevor_intel1> I think we all agree its a bad precedence to just use "what works" for convenience.
16:38:39 <jose_lausuch> what I can say is that orange-pod2 provides good results but only for 1 scenario for the moment
16:38:49 <fdegir> narindergupta: how confident are you that at least 1 of the scenarios will pass on orange pod?
16:39:13 <fdegir> narindergupta: or in other words
16:39:15 <narindergupta> fdegir: 100% as i have tested once in last couple days
16:39:22 <jose_lausuch> what is decided is up to you guys :), im just a "functester" providing feedback about results
16:39:26 <fdegir> narindergupta: does moving to orange pod guarantee it?
16:39:27 <trevor_intel1> However ... if we all agree that Orange POD 2 meets the production criteria (is Pharos complinat) then its a call given the circumstances
16:39:47 <narindergupta> fdegir: yes as we are seeing good results and all three scario passes
16:39:49 <fdegir> trevor_intel1: I had and still have some doubts about orange pod if I must be open
16:40:02 <fdegir> and David didn't clarify them before he left
16:40:04 <jose_lausuch> narindergupta: all 3 scenarios?
16:40:13 <fdegir> when orange pod was added to jenkins and jobs created for it
16:40:20 <jose_lausuch> narindergupta: I see only odl-l2 here https://build.opnfv.org/ci/job/joid-deploy-orange-pod2-daily-master/
16:40:21 <narindergupta> jose_lausuch: onos, nosdn and odl_l2
16:40:24 <fdegir> I remember there were some stuff locally available on the pod, hardcoded
16:40:27 <fdegir> so
16:40:42 <fdegir> this is really tricky
16:40:51 <fdegir> and I vote 0 - sorry Trevor
16:41:19 <fdegir> we have the reality as you say
16:41:31 <fdegir> we can just close our eyes and accept it as it is
16:41:41 <fdegir> but how long we can do this
16:41:46 <fdegir> and based on what we told to the rest
16:41:49 <fdegir> for example juniper
16:41:58 <narindergupta> jose_lausuch: https://build.opnfv.org/ci/view/joid/job/joid-os-nosdn-nofeature-ha-orange-pod2-daily-master/
16:41:58 <fdegir> I personally said no to them
16:42:10 <narindergupta> jose_lausuch: https://build.opnfv.org/ci/view/joid/job/joid-os-onos-nofeature-ha-orange-pod2-daily-master/
16:42:54 <jose_lausuch> only 1 run and the other still running
16:42:55 <fdegir> so I don't say yes to this
16:43:07 <jose_lausuch> so as I Said, we still dont have numbers for those scenarios (yet)
16:43:14 <narindergupta> jose_lausuch: yeah one run but what i am saying it passes at least once
16:43:20 <jose_lausuch> ok
16:43:32 <narindergupta> jose_lausuch: i am planning to run more on the pod though
16:44:40 <jose_lausuch> narindergupta: thats fine but it seems we dont have consensus to bring that pod as CI
16:44:41 <trevor_intel1> narindergupta: does it make sense to keep working in parallel (both the PODs) or you are out of bandwidth?
16:45:56 <narindergupta> trevor_intel i can keep it working on both but i am lone person working everywhere. I can try my best. But we should keep option open to make joid as part of B release
16:47:11 <trevor_intel1> narindergupta: agree, this is a difficult situation
16:47:53 <fdegir> I say we keep jobs on Intel POD5 & POD6 as they are
16:48:08 <fdegir> increase the run frequency on Orange POD
16:48:14 <fdegir> and run it against stable
16:48:15 <trevor_intel1> so on Intel side we need to add teh SSD ASAP (today)
16:48:16 <narindergupta> fdegir: ornage pod2 is integreted with jenkins. Would like to see what was hard coded so we can work with orange team to rectify that as well
16:48:20 <fdegir> as trevor_intel1 says
16:48:27 <fdegir> we monitor Intel PODs
16:50:44 <trevor_intel1> jmorgan1: Please can you drop your other tasks and help to debug the Joid issue (i.e. next step add 2x SSD to each of the 2 compute nodes in POD 5)?
16:50:47 <fdegir> narindergupta: I really don't remember
16:50:52 <fdegir> looking at patches
16:51:26 <narindergupta> fdegir: thanks
16:52:00 <narindergupta> trevor_intel once we have that then i can rerun the test pretty quickly.
16:52:16 <trevor_intel1> narindergupta: ok
16:52:29 <narindergupta> jose_lausuch: regarding the icmp issue in the lab how to resolve?
16:52:51 <jmorgan1> trevor_intel1: i need a JIRA task that says specifically which nodes need ssd assuming I can find them today
16:53:11 <trevor_intel1> jmorgan1: ok great!
16:53:48 <trevor_intel1> narindergupta: need to be clear about which 2 nodes ... hwo will Jack know? I can enter a Jira task
16:53:53 <jose_lausuch> narindergupta: I removed 2 rules:  sudo iptables -D FORWARD -i virbr0 -j REJECT --reject-with icmp-port-unreachable   and also with "-o"
16:54:06 <jose_lausuch> we had to do the same in lf-pod1, but I dont know if it causes the same issue..
16:54:23 <narindergupta> trevor_intel yeah i am looking into it and will give you rmm ip address
16:54:40 <trevor_intel1> narindergupta: ok i will add to the Jira task
16:55:33 <narindergupta> trevor_intel it is 10.4.7.5 and 10.4.7.4
16:55:54 <narindergupta> which is node4 and node 5
16:56:21 <trevor_intel1> narindergupta: thanks and what is the smallest capacity for each drive?
16:56:30 <trevor_intel1> narindergupta: 100GB?
16:56:36 <narindergupta> trevor_intel yes wil do
16:56:59 <narindergupta> trevor_intel i do not think so test are testing the capacity with that
16:57:05 <narindergupta> much storage
16:57:27 <narindergupta> trevor_intel so 100gb should be okj
16:58:10 <trevor_intel1> narindergupta: ok ... so we have a way forward that you can live with (at least for today)?
16:58:21 <jmorgan1> im not sure why you are asking, i think any size ssd will be fine for troubleshooting this issue
16:59:28 <narindergupta> trevor_intel yes thanks
17:00:18 <trevor_intel1> lets get on ... thanks everybody ... report back in the release stand-up meeting tomorrow?
17:00:37 <narindergupta> trevor_intel sounds good
17:00:41 <trevor_intel1> #endmeeting