#opnfv-meeting log

07:04:48 <joehuang> #startmeeting multisite
07:04:48 <collabot> Meeting started Thu Aug 25 07:04:48 2016 UTC.  The chair is joehuang. Information about MeetBot at http://wiki.debian.org/MeetBot.
07:04:48 <collabot> Useful Commands: #action #agreed #help #info #idea #link #topic.
07:04:48 <collabot> The meeting name has been set to 'multisite'
07:05:08 <joehuang> #topic rollcall
07:05:11 <joehuang> #info joehuang
07:05:31 <SAshish> #info Ashish
07:05:47 <May-meimei> #info meimei
07:06:11 <joehuang> #topic Unstable CI job running
07:06:37 <joehuang> hello, the CI job for functest is not stable
07:06:49 <sorantis> #info dimitri
07:07:10 <joehuang> do you have any propoposal, functest thought it may be the issue of SUT
07:07:37 <joehuang> Meimei said similar issue in compass, could you share the experience
07:07:52 <May-meimei> ext-network | restart api server
07:08:11 <joehuang> what's ext-network?
07:08:53 <May-meimei> joehuang: yes, you can see we reset all the api services
07:08:54 <May-meimei> https://build.opnfv.org/ci/job/compass-deploy-baremetal-daily-master/466/console
07:09:17 <joehuang> May dimitri help to check whether all services in controller(two site) work normally
07:09:18 <May-meimei> because the unstable api
07:09:40 <SAshish> yes, It works properly. He replied to that mail
07:09:43 <joehuang> you mean  API server reboot now and then or don't work
07:10:38 <sorantis> functest assumes that SUT is a freshly installed system that will be removed after deployment
07:10:41 <joehuang> I saw the mail, but if you try openstack endpoint list many times, sometimes it don't work
07:10:46 <sorantis> this is not the case for multisitie
07:11:00 <sorantis> as the result functest is not carefull with resource allocation/deallocation
07:11:45 <sorantis> I’m cleaning up the opnfv images now
07:11:57 <joehuang> if you try several times for openstack endpoint list, the error "No service with a type, name or ID of '054f26bc26a949e1aeaccf0a2b932903' exists" will occur
07:12:41 <joehuang> it seems some un-recognized service_type for some registeredendpoint
07:13:12 <joehuang> how to logon to the controller node?
07:13:26 <SAshish> can you login to Jumphost?
07:13:33 <sorantis> I’ve tried it multiple times and it works
07:13:37 <joehuang> yes, I can now
07:13:39 <sorantis> there’s no such UUID in the list
07:13:51 <SAshish> Every 2.0s: openstack endpoint list                                                                                                             Thu Aug 25 08:13:23 2016
07:14:01 <joehuang> strange, this occured this morning
07:14:08 <SAshish> No service with a type, name or ID of 'cc89788680214876a070c1fce9703650' exists.
07:14:21 <SAshish> this has occured now,
07:14:25 <SAshish> I had kept watch
07:14:29 <SAshish> watch openstack endpoint list
07:14:39 <SAshish> and got this response for one of the run
07:15:54 <SAshish> may be this is due to multiple registration of kb service
07:15:58 <joehuang> you mean you also met this error
07:16:02 <SAshish> yes
07:16:08 <SAshish> I also met it just now
07:16:29 <joehuang> how many times for running openstack endpoint list?
07:16:29 <SAshish> anyhow only if service is not there, we try to register it
07:16:52 <SAshish> I had kept watch on it, may be after 5 6 times
07:17:23 <joehuang> seems regaullay every 5.6 times
07:18:27 <joehuang> if the request goes to one of the API server which are behind haproxy, then issue occured
07:19:45 <joehuang> May be there are some dead record in Haproxy will lead to forward the request to bad or non-exist api server
07:22:08 <May-meimei> SAshish: you can try restart the api service, I am not sure it will work on fuel
07:22:50 <SAshish> and service unavailable with nova list also
07:22:52 <SAshish> okay
07:23:33 <SAshish> keystone?
07:23:44 <joehuang> you mean sometimes nova list also failed?
07:23:53 <SAshish> yes
07:25:04 <joehuang> you can check the record configured in the HAPROXY, and make sure these APIs need to be load balanced are living
07:26:14 <joehuang> one better way is to use nova --debug list, so that we know which endpoint failed to repond
07:26:40 <SAshish> yeah, kept the same to check
07:26:51 <joehuang> or openstack --debug enpoint list to check which endpoint failed
07:26:59 <joehuang> great
07:27:22 <joehuang> can you tell me how to access the controller node, I can't ssh into the controller node
07:27:51 <SAshish> ssh root@10.20.0.2
07:27:59 <SAshish> from jumphost login to fuel
07:28:01 <SAshish> ssh root@10.20.0.2
07:28:10 <SAshish> password => r00tme
07:28:29 <SAshish> once you are into fuel node
07:28:32 <SAshish> login to controller
07:28:34 <SAshish> ssh 10.20.0.3
07:29:05 <SAshish> it doesnt need password, you will land in first controller
07:29:32 <joehuang> hi, find out 1
07:29:36 <joehuang> "Openstack Cloudformation Service", "name": "heat-cfn"}, {"id": "cc89788680214876a070c1fce9703650", "enabled": true, "type": "object-store", "description": "Openstack Object-Store Service", "name": "swift"}, {"id": "e07966459621474ab19231cc369e685a", "enabled": true, "type": "image", "description": "OpenStack Image Service", "name": "glance"}]}
07:29:45 <joehuang> No service with a type, name or ID of 'e07966459621474ab19231cc369e685a' exists.
07:29:56 <joehuang> it's swift
07:30:23 <joehuang> http://192.168.0.2:35357/v2.0/
07:30:39 <joehuang> the server give feedback from http://192.168.0.2:35357/v2.0/
07:31:14 <joehuang> http://192.168.0.2:35357/v2.0/OS-KSADM/services
07:31:52 <joehuang> sorry not swift, but glance {"id":
07:31:52 <joehuang> "e07966459621474ab19231cc369e685a", "enabled": true, "type": "image", "description": "OpenStack Image Service",
07:31:55 <joehuang> "name": "glance"}]}
07:34:07 <joehuang> May we re-install the environment, I think it runs several week, may be not as clean as we had
07:34:54 <sorantis> this will also mean that the whole multi-region setup has to be reconfigured
07:35:37 <joehuang> your suggestion
07:36:12 <sorantis> use retry in the long run
07:38:09 <SAshish> has compass done some workaround?
07:38:11 <joehuang> retry should be a mechanism for all commands to SUT in functest
07:38:32 <sorantis> but only this one is causing delay
07:38:54 <sorantis> functest also checke separately nova, neutron, cinder
07:38:58 <sorantis> the commands pass
07:40:06 <joehuang> sometimes in glance image-list
07:40:22 <SAshish> I have noticed with nova list also once
07:40:29 <joehuang> yes
07:41:01 <joehuang> there are so many commands to SUT(system under test) more than that in check_os.sh
07:41:26 <joehuang> or we can skip the check_os and health check job
07:41:59 <sorantis> I’ve restarted cinder and nova api
07:42:00 <joehuang> to Meimei, can we disable check_os and health check?
07:44:18 <joehuang> to Dimitri, restard cinder/nova in both controller nodes or one?
07:44:49 <sorantis> restarting rabbit
07:44:50 <sorantis> yes
07:44:57 <sorantis> on both
07:45:22 <joehuang> ok
07:46:55 <sorantis> feels much faster now that i restarted rabbit
07:48:32 <joehuang> this issue is still there: openstack endpoint list
07:48:33 <joehuang> No service with a type, name or ID of '87d59a31161b40aeb966a02e03beaf6d' exists.
07:52:36 <joehuang> Need more time to find out why, let's work together to fix it
07:53:25 <SAshish> okay, So Joe, how did the release go?
07:53:27 <sorantis> restarted keystone
07:54:00 <sorantis> I checked the openstack bugs. apparently keystone responds slow on the ‘openstack’ commands
07:55:15 <joehuang> release will be on Sept 22 for Colorado 1.0
07:55:33 <joehuang> we need to have a stable job running
07:55:49 <joehuang> it also helps our new feature development
07:56:54 <joehuang> even bug fix also need to make sure all test cases can pass in daily job
07:57:26 <joehuang> ok, time is up, let's work offline to fix it
07:57:56 <SAshish> some bugs are there which are targetted for next release
07:58:13 <SAshish> which should not have any effect on current release
07:58:39 <sorantis> we have a new set of jenkins jobs
07:58:49 <sorantis> which why is in development focus?
08:00:14 <joehuang> what's the new set of jenkins job, you mean in OpenStack?
08:00:18 <joehuang> or OPNFV
08:00:37 <sorantis> opnfv
08:01:04 <joehuang> don't understand
08:01:47 <sorantis> we have now this
08:01:50 <joehuang> after keystone restart
08:01:56 <sorantis> multisite-kingbird-daily-colorado
08:02:08 <sorantis> multisite-kingbird-daily-master
08:02:28 <joehuang> o, both failed
08:02:45 <sorantis> they run in parallel
08:02:52 <sorantis> and I guess, they’ve blocked each other
08:03:05 <sorantis> since they’re running for over 6hrs already
08:03:30 <joehuang> disable the colorado job, we only need to maintain the master daily job
08:03:55 <SAshish> then we need to have two deploy scripts?
08:04:05 <SAshish> parameterized deply script
08:04:05 <joehuang> no, only master one
08:04:26 <SAshish> then what about colorado job
08:05:11 <joehuang> colodo job is luanched from releng colorado branch, but we don't need that to run
08:05:21 <joehuang> the master one can verify the codes
08:06:50 <sorantis> i cannot modify the jobs
08:07:00 <sorantis> ok, i have to go now
08:07:05 <sorantis> talk offline
08:07:05 <joehuang> may need help from Meimei
08:07:09 <joehuang> ok
08:07:14 <joehuang> #endmeeting