14:04:34 #startmeeting FD.io CSIT project meetings 14:04:34 Meeting started Wed Oct 9 14:04:34 2019 UTC. The chair is mackonstan. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:04:34 Useful Commands: #action #agreed #help #info #idea #link #topic. 14:04:34 The meeting name has been set to 'fd_io_csit_project_meetings' 14:04:41 #chair 14:04:41 Current chairs: mackonstan 14:04:53 #info Jan Gelety 14:05:23 #topic Agenda bashing 14:06:24 #topic FD.io CSIT physical labs 14:07:22 #info Juraj: re 2 new ThunderX servers for vpp_device - in contact with LFN IT + Vexxhost re physical install and onboarding 14:07:54 #info Juraj: will update the testbed_specifications.md in the rep 14:08:49 #info Dave Wallace 14:10:38 #info Ed: 1ru CLX servers (with 8280) install, never got IP addresses from LF IT/Vexxhost, re-asked, waiting for response. Once received will update testbed_specifications.md in the CSIT repo. 14:12:34 #info Maciek: we had a ticket open for 2ru CLX servers and now it got closed. 14:13:17 #info Maciek: Ed pls open a separate ticket for the three 1ru CLX servers for CI/CD infra and backend work. 14:17:47 #info Ed: having ongoing issues with vpp_device machines going "flaky" after Jenkins "adventures" (crashes, unplanned downtime). Can we use 3 new CLX servers (originally destined for data processing backend plotlydash, s5ci proto) to help here? 14:18:12 #topic Inputs from LFN and FD.io projects 14:21:22 #info VPP - Dave: no updates on vpp v19.08.2. 14:22:08 #info VPP - Dave: vpp v20.01 rls milestones published 14:22:15 #link https://wiki.fd.io/view/Projects/vpp/Release_Plans/Release_Plan_20.01 14:22:53 #info TSC - Vratko: last meeting finished quickly, nothing CSIT related 14:23:39 #topic Releases - CSIT-1908.1 report 14:24:18 #info Maciek: CSIT-1908.1 report published but not announced, need to review data and compare across 19.08, then send announce email 14:25:31 #info Maciek/Vratko/Peter: CSIT-1908.1 - all tests have been finished. No more open points. 14:26:27 #info Jan: confirmed all 1908.1 jobs are finished. Need to summarise all resources taken by 1908.1 maintenance rls. 14:27:50 #topic CSIT-2001 14:29:00 #info Vratko: improving VPP API change process to make it more reliable and reduce the false positive. 14:30:42 #info Vratko: complete VAT to PAPI migration - address the API execution efficiency for scale tests. 14:32:05 #info Jan: Python 2.7 to 3x migration, .md analysis and migration plan coming to gerrit shortly. 14:33:51 #info Vratko: job for bisecting performance regressions (leveraging per patch perf test work). 14:35:21 #info Maciek/Tibor/Peter: a standalone test data processing backend - datastore, analytics/query engine. Stop relying on Nexus as results file store. 14:36:19 #info Vratko/Tibor/Peter: Making use of HDRhistogram in TRex, and higher resolution of latency data for performance tests. 14:37:51 #info Vratko/Maciek: reconf tests methodology - see if we can apply b2b-frame methodology described in ietf bmwg draft. 14:40:44 #info Peter/Maciek: per vpp node efficiency - today storing elog capturing thread barriers - for perfmon we are missing an API to catch two values for the run, we would need to check if this got resolved. 14:42:10 #info Peter: start with a new telemetry approach - per packet path analysis, similarly how it's done in NFVbench, see how this could be applied to NFV density tests and actually all other tests. 14:43:38 #info Maciek/Tibor: trending regressions - add announce emails to csit-report. 14:45:25 #info Vratko: anomaly detection - still seeing some noise, more data doesn't seem to be helping, no pattern. Need more inside knowledge, white-box, need more telemetry data from tests to see if any correlation can be found. Affects trending anomaly detection, per patch perf, perf bisecting. 14:47:24 #info Peter/Maciek: vhost/memif - adding vpp-in-container with ipsec. 14:48:43 #info Peter: seeing the new tests being pushed for Load-Balancer, baseline tests 14:49:25 #info last LB is for Maglev 14:49:30 #info Peter: seeing new tests for "NAT44 L3 DSR" 14:50:11 #info Vratko: improve suite generator for heat-map graphed tests e.g. NFV density tests 14:50:43 #info Maciek: any other work in services and L47 space? 14:54:13 #info Juraj: testbeds - Arm - adding more ThunderX machines for vpp_device to run csit-vpp and vpp-csit device tests 14:54:48 #info Juraj: productize per VPP patch (with voting?) vpp-csit device tests for Arm. 14:56:54 #info Goal: add more vpp_device tests for better VPP API coverage, as those are executed per vpp patch and per csit patch 14:57:32 #topic Operational status 15:01:40 #info Ed: situation right now - stabilized back to normal - root cause not known. Some issues with vpp_device machines (Peter handling). A simple Registry app "stopped intelligently responding", redundancy didn't kick in. On Registry recovery, all queued jobs kicked off, and overloaded Jenkins with ~160 jobs in the queue (LFN ONAP can handle many more). Jenkins tipped over handling number of requests to Nomad cluster (Nomad can handle 15:01:40 many many more). Took a while to recover from Jenkins crash. Suspecting some other factor in DC network that impacted the recovery. 15:03:01 #info Ed: adding more healthchecks to prevent Registry app HA failure. 15:06:54 #info Peter: 10 servers lost mgmt IP addresses, configured as static without DHCP. Unclear how it happened. Root cause analysis in progress. (Some external system interference??)(ONAP servers experienced similar situation this week). 15:11:10 #endmeeting