#fdio-csit log

14:04:34 <mackonstan> #startmeeting FD.io CSIT project meetings
14:04:34 <collabot_> Meeting started Wed Oct  9 14:04:34 2019 UTC.  The chair is mackonstan. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:04:34 <collabot_> Useful Commands: #action #agreed #help #info #idea #link #topic.
14:04:34 <collabot_> The meeting name has been set to 'fd_io_csit_project_meetings'
14:04:41 <mackonstan> #chair
14:04:41 <collabot_> Current chairs: mackonstan
14:04:53 <jgelety> #info Jan Gelety
14:05:23 <mackonstan> #topic Agenda bashing
14:06:24 <mackonstan> #topic FD.io CSIT physical labs
14:07:22 <mackonstan> #info Juraj: re 2 new ThunderX servers for vpp_device - in contact with LFN IT + Vexxhost re physical install and onboarding
14:07:54 <mackonstan> #info Juraj: will update the testbed_specifications.md in the rep
14:08:49 <dwallacelf> #info Dave Wallace
14:10:38 <mackonstan> #info Ed: 1ru CLX servers (with 8280) install, never got IP addresses from LF IT/Vexxhost, re-asked, waiting for response. Once received will update testbed_specifications.md in the CSIT repo.
14:12:34 <mackonstan> #info Maciek: we had a ticket open for 2ru CLX servers and now it got closed.
14:13:17 <mackonstan> #info Maciek: Ed pls open a separate ticket for the three 1ru CLX servers for CI/CD infra and backend work.
14:17:47 <mackonstan> #info Ed: having ongoing issues with vpp_device machines going "flaky" after Jenkins "adventures" (crashes, unplanned downtime). Can we use 3 new CLX servers (originally destined for data processing backend plotlydash, s5ci proto) to help here?
14:18:12 <mackonstan> #topic Inputs from LFN and FD.io projects
14:21:22 <mackonstan> #info VPP - Dave: no updates on vpp v19.08.2.
14:22:08 <mackonstan> #info VPP - Dave: vpp v20.01 rls milestones published
14:22:15 <mackonstan> #link https://wiki.fd.io/view/Projects/vpp/Release_Plans/Release_Plan_20.01
14:22:53 <mackonstan> #info TSC - Vratko: last meeting finished quickly, nothing CSIT related
14:23:39 <mackonstan> #topic Releases - CSIT-1908.1 report
14:24:18 <mackonstan> #info Maciek: CSIT-1908.1 report published but not announced, need to review data and compare across 19.08, then send announce email
14:25:31 <mackonstan> #info Maciek/Vratko/Peter: CSIT-1908.1 - all tests have been finished. No more open points.
14:26:27 <mackonstan> #info Jan: confirmed all 1908.1 jobs are finished. Need to summarise all resources taken by 1908.1 maintenance rls.
14:27:50 <mackonstan> #topic CSIT-2001
14:29:00 <mackonstan> #info Vratko: improving VPP API change process to make it more reliable and reduce the false positive.
14:30:42 <mackonstan> #info Vratko: complete VAT to PAPI migration - address the API execution efficiency for scale tests.
14:32:05 <mackonstan> #info Jan: Python 2.7 to 3x migration, .md analysis and migration plan coming to gerrit shortly.
14:33:51 <mackonstan> #info Vratko: job for bisecting performance regressions (leveraging per patch perf test work).
14:35:21 <mackonstan> #info Maciek/Tibor/Peter: a standalone test data processing backend - datastore, analytics/query engine. Stop relying on Nexus as results file store.
14:36:19 <mackonstan> #info Vratko/Tibor/Peter: Making use of HDRhistogram in TRex, and higher resolution of latency data for performance tests.
14:37:51 <mackonstan> #info Vratko/Maciek: reconf tests methodology - see if we can apply b2b-frame methodology described in ietf bmwg draft.
14:40:44 <mackonstan> #info Peter/Maciek: per vpp node efficiency - today storing elog capturing thread barriers - for perfmon we are missing an API to catch two values for the run, we would need to check if this got resolved.
14:42:10 <mackonstan> #info Peter: start with a new telemetry approach - per packet path analysis, similarly how it's done in NFVbench, see how this could be applied to NFV density tests and actually all other tests.
14:43:38 <mackonstan> #info Maciek/Tibor: trending regressions - add announce emails to csit-report.
14:45:25 <mackonstan> #info Vratko: anomaly detection - still seeing some noise, more data doesn't seem to be helping, no pattern. Need more inside knowledge, white-box, need more telemetry data from tests to see if any correlation can be found. Affects trending anomaly detection, per patch perf, perf bisecting.
14:47:24 <mackonstan> #info Peter/Maciek: vhost/memif - adding vpp-in-container with ipsec.
14:48:43 <mackonstan> #info Peter: seeing the new tests being pushed for Load-Balancer, baseline tests
14:49:25 <mackonstan> #info last LB is for Maglev
14:49:30 <mackonstan> #info Peter: seeing new tests for "NAT44 L3 DSR"
14:50:11 <mackonstan> #info Vratko: improve suite generator for heat-map graphed tests e.g. NFV density tests
14:50:43 <mackonstan> #info Maciek: any other work in services and L47 space?
14:54:13 <mackonstan> #info Juraj: testbeds - Arm - adding more ThunderX machines for vpp_device to run csit-vpp and vpp-csit device tests
14:54:48 <mackonstan> #info Juraj: productize per VPP patch (with voting?) vpp-csit device tests for Arm.
14:56:54 <mackonstan> #info Goal: add more vpp_device tests for better VPP API coverage, as those are executed per vpp patch and per csit patch
14:57:32 <mackonstan> #topic Operational status
15:01:40 <mackonstan> #info Ed: situation right now - stabilized back to normal - root cause not known. Some issues with vpp_device machines (Peter handling). A simple Registry app "stopped intelligently responding", redundancy didn't kick in. On Registry recovery, all queued jobs kicked off, and overloaded Jenkins with ~160 jobs in the queue (LFN ONAP can handle many more). Jenkins tipped over handling number of requests to Nomad cluster (Nomad can handle
15:01:40 <mackonstan> many many more). Took a while to recover from Jenkins crash. Suspecting some other factor in DC network that impacted the recovery.
15:03:01 <mackonstan> #info Ed: adding more healthchecks to prevent Registry app HA failure.
15:06:54 <mackonstan> #info Peter: 10 servers lost mgmt IP addresses, configured as static without DHCP. Unclear how it happened. Root cause analysis in progress. (Some external system interference??)(ONAP servers experienced similar situation this week).
15:11:10 <mackonstan> #endmeeting