15:58:57 #startmeeting OPNFV Pharos 15:58:57 Meeting started Wed Feb 17 15:58:57 2016 UTC. The chair is trevor_intel1. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:58:57 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:58:57 The meeting name has been set to 'opnfv_pharos' 15:59:07 #info Trevor Cooper 15:59:54 #topic roll call 16:01:33 jose_lausuch: ping 16:01:42 narindergupta: ping 16:01:49 #info Narinder Gupta 16:01:51 trevor_intel1: pong 16:02:18 Hey guys ... can we get going? 16:02:36 #topic JOID testing 16:02:57 #info Jack Morgan 16:03:00 is there a gotomeeting? 16:03:04 trevor_intel we are seeing some performance issue in pod5 and pod6 16:03:04 or irc only 16:03:16 jose_lausuch: IRC only 16:03:18 ok 16:03:23 ya, as narindergupta said 16:03:28 so there was a proposal to make orange pod2 as ci pod 16:03:32 something is wrong there 16:03:41 jose_lausuch: are the test failures consistent between POD 5 and 6? 16:03:43 until we figure out issue in lab 16:03:43 I need help to detect what causes the problems 16:04:08 trevor_intel1 the problem we detected is that the jjob was timing out 16:04:09 bit different but takes almost same time 16:04:17 then we increased the functest jjob timeout to 400min 16:04:25 and the job finished,but it took 5 hrs 16:04:38 I checked and rally was taking too much time to execute cinder tests 16:04:53 narindergupta: I saw this comment ... Narinder suggested that it could be due that on intel pod5 and pod6 we do not have extra hard disk so we are using the os disk 16:04:57 narindergupta thinks it could be due to lack of disks in those servers 16:05:01 yes 16:05:05 that could be a reason 16:05:21 but also, the authentication tests (keystone) took much longer than expected 16:05:23 jose_lausuch: But this is a standard POD 16:05:31 trevor_intel i am concerned about temptest failures as well 16:05:48 we got many more tempest failures than for orange-pod2 (almost nothing) 16:05:49 so 16:06:02 as we are seeing only 20% test case passes but for orange pod 98% passes 16:06:05 jmorgan1: how much storage on POD 5 and 6? 16:06:07 we could see again if we have some network issues or something 16:06:09 with same code of joid 16:06:27 that is not morgan orange :) 16:06:36 morgan is on vacation this week 16:06:40 jose_lausuch: oops :( 16:06:49 trevor_intel1: they are pharos complient minus the ssd 16:07:03 jmorgan1: how much storage on POD 5 and 6? 16:07:22 trevor_intel1: 2TB 16:07:53 jmorgan1: ok 16:08:06 jose_lausuch: how much storage on Orage POD? 16:08:11 jmorgan1:are the NICs 1 or 10Gb? 16:08:19 jose_lausuch: Does Orange POD use SSD? 16:08:19 jose_lausuch: both 16:08:22 trevor_intel1: dont know, as I dont have access to that pod 16:08:41 trevor_intel yes they have additional ssd per machine 16:08:54 and we are using that for ceph block storage 16:09:21 could that also cause the tempest problems? there are some related to cinder as well 16:09:22 jose_lausuch: trevor_intel see my comment 16:09:42 jmorgan1: Is it easy to add an SSD i.e. readily available and easy to configure? 16:10:02 trevor_intel1: i don't know, would have to lok 16:10:08 at least on pod5 16:10:25 I dont know 100% if that would solve the problem, but worth to try... 16:10:28 but i don't understand, if a pod doesn't have an ssd your project fails? 16:10:46 jose_lausuch: its a lot of money to just guess 16:10:52 yap 16:11:00 storage issues 16:11:08 jose_lausuch: I am reluctant to just jump to a different POD without any attempt to debug IMO that is a bad precedence 16:11:11 jose_lausuch: what about the perfomrance issue? how is it related to networking? 16:11:42 trevor_intel1: +1 if were not 1 week before releasing :) 16:11:44 jose_lausuch: for example, between nodes or between nodes and external system on internet? 16:11:45 trevor_intel we will continue debugging as release next week 16:12:08 jmorgan1: I dont know yet 16:12:09 so that we can consider an another option until B release. 16:12:37 narindergupta: not sure I understand 16:12:40 trevor_intel1: please look at the 2 last tables https://wiki.opnfv.org/functextnexttaks#brahmaputra_summary 16:12:52 big differences between joid on different pods 16:12:54 my question is 16:12:57 what do we do here? 16:13:07 troubleshoot, of course 16:13:18 but... 16:13:53 trevor_intel what i meant it we have release next year and with the pod we are not even close to 90% in intel pods. we can keep debugging the issue but if we can consider orange pod just for release would be fantastic. 16:14:17 +1 16:14:18 and we will keep continue troubleshooting the intel pods for performance issues 16:14:35 narindergupta: but there has been no debugging or did as I misunderstand? 16:14:47 trevor_intel1: I started yesterday 16:14:53 jose_lausuch: was working with on debuggin on it 16:15:26 debuggint on it 16:15:28 as found out increasing timeout passes more test but temptest are still failing for some reason and we saw stange behavior 16:17:21 jose_lausuch: If we were able to add a SSD today hwo easy/qucik to tell if that fixes it? 16:17:24 trevor_intel and we are testing on daily basis and few issue resolved during the process as well. so debugging continued 16:17:51 trevor_intel: narindergupta would have to configure MAAS to use them somehow, so a deployment+functest 16:18:12 trevor_intel correct would come to know within a day 16:18:45 narindergupta: do you have a day? 16:18:45 narindergupta: given the circumstances 16:18:45 jmorgan1: can you check the jumphost on intel pod 5? 16:18:48 we have iptables rules rejecting icmp... 16:19:17 we are running again very tight timeline but one day is ok 16:19:57 jose_lausuch: but need to decide fast as next week i will be out and i want to finish the decision asap. 16:20:12 jose_lausuch: is it something new you found? 16:20:36 jmorgan1: by when can you find out re. SSD? 16:20:41 narindergupta: not sure if its relevant, but we had issues with wrong iptables rules in lf-pod1 for apex.. 16:21:16 jose_lausuch: i'm not sure what you expect me to check? we install the base OS and the pod users takes over from there 16:21:27 trevor_intel1: I'm not sure aht you are talking about 16:21:41 jmorgan1: adding an SSD 16:21:59 trevor_intel1: so waht do you want me to find out? 16:22:47 narindergupta: is SSD needed on all nodes? 16:23:21 trevor_intel no only on compute nodes 16:23:32 narindergupta: so 2 of the nodes? 16:24:01 and we use two replicas atleast so 2 per compute nodes and total 4 in two nodes 16:24:12 correcty 16:24:44 narindergupta: that what is in Orange POD? 16:25:01 orange pod all nodes has ssd 16:25:10 narindergupta: in the spec we say 1x 100GB SSD 16:25:14 attached including the controller nodes 16:25:22 narindergupta: but x2 in each node? 16:26:20 replicas are two in ceph and ornage pod has two in computes and one in control. 16:26:40 but requirement is only in compute as per my bundle 16:27:00 narindergupta: ok so seems we need 2 SSD in each compute node for a total of 4 (as you said earlier) 16:27:21 correct 16:27:53 jmorgan1: the question is how quickly can you find and attach 4 SSDs to POD 5? 16:28:22 jmorgan1: This is a debugging exercise so could be borrowed 16:28:37 jmorgan1: correct if not today then we do not have time to look into it further. As i am traveling to Barcelona for MWC 2016 on Sat. 16:28:59 narindergupta: how many scenarios do you plan to release next week 16:29:05 3? 16:30:08 it seems it takes up to 4 hours per run on orange pod 16:30:22 narindergupta: so if we add ssd today, you wont be able to properly troubleshoot this then? 16:30:26 fdegir: 3 yes 16:30:28 4 runs x 3 scenarios x 4 hours = 48 hours 16:30:42 odl_l2, nosdn, onos 16:30:43 then we have almost that much left before you leave for MWC 16:30:56 fdegir: assuming that adding SSDs solves the time problem... 16:31:11 I'm not assuming anything 16:31:15 just doing the math 16:31:18 :) 16:31:34 trevor_intel1: i would have to drop everything else on my todo list to go investigate this 16:31:36 and 48 hours is for if everything goes totally fine 16:31:54 trevor_intel it seems time wise its really tight 16:31:54 and narindergupta needs to do some documentation I suppose 16:31:57 Fatih: So if we declare Orange POD production stable ... they are already done? 16:32:08 trevor_intel1: don't know the details 16:32:25 just pointing the obvious 16:32:28 fdegir: yeah i started looking into it. unfortunately i can not replicate myself :( 16:32:35 trevor_intel1 I think we need to enalble more scenarios, only odl-l2 running I've seen 16:32:39 but maybe I'm wrong 16:32:48 jose_lausuch: So if we declare Orange POD production stable ... Joid is aleady good to release? 16:32:53 jose_lausuch: we have nosdn ran once 16:33:04 but looking at this: https://build.opnfv.org/ci/job/joid-deploy-orange-pod2-daily-master/ 16:33:06 jose_lausuch: today onos is also running 16:33:06 jose_lausuch: ok I see 16:33:14 trevor_intel1 not yet 16:35:28 narindergupta: Fatih's math seems to point there is no slack (not even 1 day)? 16:35:53 trevor_intel yerah specially i am leaving for MWC this Saturday 16:37:10 trevor_intel i did not do any match 16:37:27 Fatih: seems this is a releng decision? 16:37:35 :) 16:37:43 Fatih: glad to see a smile! 16:37:52 it should be joint decision between pharos and releng 16:38:32 I think we all agree its a bad precedence to just use "what works" for convenience. 16:38:39 what I can say is that orange-pod2 provides good results but only for 1 scenario for the moment 16:38:49 narindergupta: how confident are you that at least 1 of the scenarios will pass on orange pod? 16:39:13 narindergupta: or in other words 16:39:15 fdegir: 100% as i have tested once in last couple days 16:39:22 what is decided is up to you guys :), im just a "functester" providing feedback about results 16:39:26 narindergupta: does moving to orange pod guarantee it? 16:39:27 However ... if we all agree that Orange POD 2 meets the production criteria (is Pharos complinat) then its a call given the circumstances 16:39:47 fdegir: yes as we are seeing good results and all three scario passes 16:39:49 trevor_intel1: I had and still have some doubts about orange pod if I must be open 16:40:02 and David didn't clarify them before he left 16:40:04 narindergupta: all 3 scenarios? 16:40:13 when orange pod was added to jenkins and jobs created for it 16:40:20 narindergupta: I see only odl-l2 here https://build.opnfv.org/ci/job/joid-deploy-orange-pod2-daily-master/ 16:40:21 jose_lausuch: onos, nosdn and odl_l2 16:40:24 I remember there were some stuff locally available on the pod, hardcoded 16:40:27 so 16:40:42 this is really tricky 16:40:51 and I vote 0 - sorry Trevor 16:41:19 we have the reality as you say 16:41:31 we can just close our eyes and accept it as it is 16:41:41 but how long we can do this 16:41:46 and based on what we told to the rest 16:41:49 for example juniper 16:41:58 jose_lausuch: https://build.opnfv.org/ci/view/joid/job/joid-os-nosdn-nofeature-ha-orange-pod2-daily-master/ 16:41:58 I personally said no to them 16:42:10 jose_lausuch: https://build.opnfv.org/ci/view/joid/job/joid-os-onos-nofeature-ha-orange-pod2-daily-master/ 16:42:54 only 1 run and the other still running 16:42:55 so I don't say yes to this 16:43:07 so as I Said, we still dont have numbers for those scenarios (yet) 16:43:14 jose_lausuch: yeah one run but what i am saying it passes at least once 16:43:20 ok 16:43:32 jose_lausuch: i am planning to run more on the pod though 16:44:40 narindergupta: thats fine but it seems we dont have consensus to bring that pod as CI 16:44:41 narindergupta: does it make sense to keep working in parallel (both the PODs) or you are out of bandwidth? 16:45:56 trevor_intel i can keep it working on both but i am lone person working everywhere. I can try my best. But we should keep option open to make joid as part of B release 16:47:11 narindergupta: agree, this is a difficult situation 16:47:53 I say we keep jobs on Intel POD5 & POD6 as they are 16:48:08 increase the run frequency on Orange POD 16:48:14 and run it against stable 16:48:15 so on Intel side we need to add teh SSD ASAP (today) 16:48:16 fdegir: ornage pod2 is integreted with jenkins. Would like to see what was hard coded so we can work with orange team to rectify that as well 16:48:20 as trevor_intel1 says 16:48:27 we monitor Intel PODs 16:50:44 jmorgan1: Please can you drop your other tasks and help to debug the Joid issue (i.e. next step add 2x SSD to each of the 2 compute nodes in POD 5)? 16:50:47 narindergupta: I really don't remember 16:50:52 looking at patches 16:51:26 fdegir: thanks 16:52:00 trevor_intel once we have that then i can rerun the test pretty quickly. 16:52:16 narindergupta: ok 16:52:29 jose_lausuch: regarding the icmp issue in the lab how to resolve? 16:52:51 trevor_intel1: i need a JIRA task that says specifically which nodes need ssd assuming I can find them today 16:53:11 jmorgan1: ok great! 16:53:48 narindergupta: need to be clear about which 2 nodes ... hwo will Jack know? I can enter a Jira task 16:53:53 narindergupta: I removed 2 rules: sudo iptables -D FORWARD -i virbr0 -j REJECT --reject-with icmp-port-unreachable and also with "-o" 16:54:06 we had to do the same in lf-pod1, but I dont know if it causes the same issue.. 16:54:23 trevor_intel yeah i am looking into it and will give you rmm ip address 16:54:40 narindergupta: ok i will add to the Jira task 16:55:33 trevor_intel it is 10.4.7.5 and 10.4.7.4 16:55:54 which is node4 and node 5 16:56:21 narindergupta: thanks and what is the smallest capacity for each drive? 16:56:30 narindergupta: 100GB? 16:56:36 trevor_intel yes wil do 16:56:59 trevor_intel i do not think so test are testing the capacity with that 16:57:05 much storage 16:57:27 trevor_intel so 100gb should be okj 16:58:10 narindergupta: ok ... so we have a way forward that you can live with (at least for today)? 16:58:21 im not sure why you are asking, i think any size ssd will be fine for troubleshooting this issue 16:59:28 trevor_intel yes thanks 17:00:18 lets get on ... thanks everybody ... report back in the release stand-up meeting tomorrow? 17:00:37 trevor_intel sounds good 17:00:41 #endmeeting