07:04:48 <joehuang> #startmeeting multisite 07:04:48 <collabot> Meeting started Thu Aug 25 07:04:48 2016 UTC. The chair is joehuang. Information about MeetBot at http://wiki.debian.org/MeetBot. 07:04:48 <collabot> Useful Commands: #action #agreed #help #info #idea #link #topic. 07:04:48 <collabot> The meeting name has been set to 'multisite' 07:05:08 <joehuang> #topic rollcall 07:05:11 <joehuang> #info joehuang 07:05:31 <SAshish> #info Ashish 07:05:47 <May-meimei> #info meimei 07:06:11 <joehuang> #topic Unstable CI job running 07:06:37 <joehuang> hello, the CI job for functest is not stable 07:06:49 <sorantis> #info dimitri 07:07:10 <joehuang> do you have any propoposal, functest thought it may be the issue of SUT 07:07:37 <joehuang> Meimei said similar issue in compass, could you share the experience 07:07:52 <May-meimei> ext-network | restart api server 07:08:11 <joehuang> what's ext-network? 07:08:53 <May-meimei> joehuang: yes, you can see we reset all the api services 07:08:54 <May-meimei> https://build.opnfv.org/ci/job/compass-deploy-baremetal-daily-master/466/console 07:09:17 <joehuang> May dimitri help to check whether all services in controller(two site) work normally 07:09:18 <May-meimei> because the unstable api 07:09:40 <SAshish> yes, It works properly. He replied to that mail 07:09:43 <joehuang> you mean API server reboot now and then or don't work 07:10:38 <sorantis> functest assumes that SUT is a freshly installed system that will be removed after deployment 07:10:41 <joehuang> I saw the mail, but if you try openstack endpoint list many times, sometimes it don't work 07:10:46 <sorantis> this is not the case for multisitie 07:11:00 <sorantis> as the result functest is not carefull with resource allocation/deallocation 07:11:45 <sorantis> I’m cleaning up the opnfv images now 07:11:57 <joehuang> if you try several times for openstack endpoint list, the error "No service with a type, name or ID of '054f26bc26a949e1aeaccf0a2b932903' exists" will occur 07:12:41 <joehuang> it seems some un-recognized service_type for some registeredendpoint 07:13:12 <joehuang> how to logon to the controller node? 07:13:26 <SAshish> can you login to Jumphost? 07:13:33 <sorantis> I’ve tried it multiple times and it works 07:13:37 <joehuang> yes, I can now 07:13:39 <sorantis> there’s no such UUID in the list 07:13:51 <SAshish> Every 2.0s: openstack endpoint list Thu Aug 25 08:13:23 2016 07:14:01 <joehuang> strange, this occured this morning 07:14:08 <SAshish> No service with a type, name or ID of 'cc89788680214876a070c1fce9703650' exists. 07:14:21 <SAshish> this has occured now, 07:14:25 <SAshish> I had kept watch 07:14:29 <SAshish> watch openstack endpoint list 07:14:39 <SAshish> and got this response for one of the run 07:15:54 <SAshish> may be this is due to multiple registration of kb service 07:15:58 <joehuang> you mean you also met this error 07:16:02 <SAshish> yes 07:16:08 <SAshish> I also met it just now 07:16:29 <joehuang> how many times for running openstack endpoint list? 07:16:29 <SAshish> anyhow only if service is not there, we try to register it 07:16:52 <SAshish> I had kept watch on it, may be after 5 6 times 07:17:23 <joehuang> seems regaullay every 5.6 times 07:18:27 <joehuang> if the request goes to one of the API server which are behind haproxy, then issue occured 07:19:45 <joehuang> May be there are some dead record in Haproxy will lead to forward the request to bad or non-exist api server 07:22:08 <May-meimei> SAshish: you can try restart the api service, I am not sure it will work on fuel 07:22:50 <SAshish> and service unavailable with nova list also 07:22:52 <SAshish> okay 07:23:33 <SAshish> keystone? 07:23:44 <joehuang> you mean sometimes nova list also failed? 07:23:53 <SAshish> yes 07:25:04 <joehuang> you can check the record configured in the HAPROXY, and make sure these APIs need to be load balanced are living 07:26:14 <joehuang> one better way is to use nova --debug list, so that we know which endpoint failed to repond 07:26:40 <SAshish> yeah, kept the same to check 07:26:51 <joehuang> or openstack --debug enpoint list to check which endpoint failed 07:26:59 <joehuang> great 07:27:22 <joehuang> can you tell me how to access the controller node, I can't ssh into the controller node 07:27:51 <SAshish> ssh root@10.20.0.2 07:27:59 <SAshish> from jumphost login to fuel 07:28:01 <SAshish> ssh root@10.20.0.2 07:28:10 <SAshish> password => r00tme 07:28:29 <SAshish> once you are into fuel node 07:28:32 <SAshish> login to controller 07:28:34 <SAshish> ssh 10.20.0.3 07:29:05 <SAshish> it doesnt need password, you will land in first controller 07:29:32 <joehuang> hi, find out 1 07:29:36 <joehuang> "Openstack Cloudformation Service", "name": "heat-cfn"}, {"id": "cc89788680214876a070c1fce9703650", "enabled": true, "type": "object-store", "description": "Openstack Object-Store Service", "name": "swift"}, {"id": "e07966459621474ab19231cc369e685a", "enabled": true, "type": "image", "description": "OpenStack Image Service", "name": "glance"}]} 07:29:45 <joehuang> No service with a type, name or ID of 'e07966459621474ab19231cc369e685a' exists. 07:29:56 <joehuang> it's swift 07:30:23 <joehuang> http://192.168.0.2:35357/v2.0/ 07:30:39 <joehuang> the server give feedback from http://192.168.0.2:35357/v2.0/ 07:31:14 <joehuang> http://192.168.0.2:35357/v2.0/OS-KSADM/services 07:31:52 <joehuang> sorry not swift, but glance {"id": 07:31:52 <joehuang> "e07966459621474ab19231cc369e685a", "enabled": true, "type": "image", "description": "OpenStack Image Service", 07:31:55 <joehuang> "name": "glance"}]} 07:34:07 <joehuang> May we re-install the environment, I think it runs several week, may be not as clean as we had 07:34:54 <sorantis> this will also mean that the whole multi-region setup has to be reconfigured 07:35:37 <joehuang> your suggestion 07:36:12 <sorantis> use retry in the long run 07:38:09 <SAshish> has compass done some workaround? 07:38:11 <joehuang> retry should be a mechanism for all commands to SUT in functest 07:38:32 <sorantis> but only this one is causing delay 07:38:54 <sorantis> functest also checke separately nova, neutron, cinder 07:38:58 <sorantis> the commands pass 07:40:06 <joehuang> sometimes in glance image-list 07:40:22 <SAshish> I have noticed with nova list also once 07:40:29 <joehuang> yes 07:41:01 <joehuang> there are so many commands to SUT(system under test) more than that in check_os.sh 07:41:26 <joehuang> or we can skip the check_os and health check job 07:41:59 <sorantis> I’ve restarted cinder and nova api 07:42:00 <joehuang> to Meimei, can we disable check_os and health check? 07:44:18 <joehuang> to Dimitri, restard cinder/nova in both controller nodes or one? 07:44:49 <sorantis> restarting rabbit 07:44:50 <sorantis> yes 07:44:57 <sorantis> on both 07:45:22 <joehuang> ok 07:46:55 <sorantis> feels much faster now that i restarted rabbit 07:48:32 <joehuang> this issue is still there: openstack endpoint list 07:48:33 <joehuang> No service with a type, name or ID of '87d59a31161b40aeb966a02e03beaf6d' exists. 07:52:36 <joehuang> Need more time to find out why, let's work together to fix it 07:53:25 <SAshish> okay, So Joe, how did the release go? 07:53:27 <sorantis> restarted keystone 07:54:00 <sorantis> I checked the openstack bugs. apparently keystone responds slow on the ‘openstack’ commands 07:55:15 <joehuang> release will be on Sept 22 for Colorado 1.0 07:55:33 <joehuang> we need to have a stable job running 07:55:49 <joehuang> it also helps our new feature development 07:56:54 <joehuang> even bug fix also need to make sure all test cases can pass in daily job 07:57:26 <joehuang> ok, time is up, let's work offline to fix it 07:57:56 <SAshish> some bugs are there which are targetted for next release 07:58:13 <SAshish> which should not have any effect on current release 07:58:39 <sorantis> we have a new set of jenkins jobs 07:58:49 <sorantis> which why is in development focus? 08:00:14 <joehuang> what's the new set of jenkins job, you mean in OpenStack? 08:00:18 <joehuang> or OPNFV 08:00:37 <sorantis> opnfv 08:01:04 <joehuang> don't understand 08:01:47 <sorantis> we have now this 08:01:50 <joehuang> after keystone restart 08:01:56 <sorantis> multisite-kingbird-daily-colorado 08:02:08 <sorantis> multisite-kingbird-daily-master 08:02:28 <joehuang> o, both failed 08:02:45 <sorantis> they run in parallel 08:02:52 <sorantis> and I guess, they’ve blocked each other 08:03:05 <sorantis> since they’re running for over 6hrs already 08:03:30 <joehuang> disable the colorado job, we only need to maintain the master daily job 08:03:55 <SAshish> then we need to have two deploy scripts? 08:04:05 <SAshish> parameterized deply script 08:04:05 <joehuang> no, only master one 08:04:26 <SAshish> then what about colorado job 08:05:11 <joehuang> colodo job is luanched from releng colorado branch, but we don't need that to run 08:05:21 <joehuang> the master one can verify the codes 08:06:50 <sorantis> i cannot modify the jobs 08:07:00 <sorantis> ok, i have to go now 08:07:05 <sorantis> talk offline 08:07:05 <joehuang> may need help from Meimei 08:07:09 <joehuang> ok 08:07:14 <joehuang> #endmeeting