#fdio-csit: FD.io CSIT project meetings

Meeting started by mackonstan at 14:04:34 UTC (full logs).

Meeting summary

    1. Jan Gelety (jgelety, 14:04:53)

  1. Agenda bashing (mackonstan, 14:05:23)
  2. FD.io CSIT physical labs (mackonstan, 14:06:24)
    1. Juraj: re 2 new ThunderX servers for vpp_device - in contact with LFN IT + Vexxhost re physical install and onboarding (mackonstan, 14:07:22)
    2. Juraj: will update the testbed_specifications.md in the rep (mackonstan, 14:07:54)
    3. Dave Wallace (dwallacelf, 14:08:49)
    4. Ed: 1ru CLX servers (with 8280) install, never got IP addresses from LF IT/Vexxhost, re-asked, waiting for response. Once received will update testbed_specifications.md in the CSIT repo. (mackonstan, 14:10:38)
    5. Maciek: we had a ticket open for 2ru CLX servers and now it got closed. (mackonstan, 14:12:34)
    6. Maciek: Ed pls open a separate ticket for the three 1ru CLX servers for CI/CD infra and backend work. (mackonstan, 14:13:17)
    7. Ed: having ongoing issues with vpp_device machines going "flaky" after Jenkins "adventures" (crashes, unplanned downtime). Can we use 3 new CLX servers (originally destined for data processing backend plotlydash, s5ci proto) to help here? (mackonstan, 14:17:47)

  3. Inputs from LFN and FD.io projects (mackonstan, 14:18:12)
    1. VPP - Dave: no updates on vpp v19.08.2. (mackonstan, 14:21:22)
    2. VPP - Dave: vpp v20.01 rls milestones published (mackonstan, 14:22:08)
    3. https://wiki.fd.io/view/Projects/vpp/Release_Plans/Release_Plan_20.01 (mackonstan, 14:22:15)
    4. TSC - Vratko: last meeting finished quickly, nothing CSIT related (mackonstan, 14:22:53)

  4. Releases - CSIT-1908.1 report (mackonstan, 14:23:39)
    1. Maciek: CSIT-1908.1 report published but not announced, need to review data and compare across 19.08, then send announce email (mackonstan, 14:24:18)
    2. Maciek/Vratko/Peter: CSIT-1908.1 - all tests have been finished. No more open points. (mackonstan, 14:25:31)
    3. Jan: confirmed all 1908.1 jobs are finished. Need to summarise all resources taken by 1908.1 maintenance rls. (mackonstan, 14:26:27)

  5. CSIT-2001 (mackonstan, 14:27:50)
    1. Vratko: improving VPP API change process to make it more reliable and reduce the false positive. (mackonstan, 14:29:00)
    2. Vratko: complete VAT to PAPI migration - address the API execution efficiency for scale tests. (mackonstan, 14:30:42)
    3. Jan: Python 2.7 to 3x migration, .md analysis and migration plan coming to gerrit shortly. (mackonstan, 14:32:05)
    4. Vratko: job for bisecting performance regressions (leveraging per patch perf test work). (mackonstan, 14:33:51)
    5. Maciek/Tibor/Peter: a standalone test data processing backend - datastore, analytics/query engine. Stop relying on Nexus as results file store. (mackonstan, 14:35:21)
    6. Vratko/Tibor/Peter: Making use of HDRhistogram in TRex, and higher resolution of latency data for performance tests. (mackonstan, 14:36:19)
    7. Vratko/Maciek: reconf tests methodology - see if we can apply b2b-frame methodology described in ietf bmwg draft. (mackonstan, 14:37:51)
    8. Peter/Maciek: per vpp node efficiency - today storing elog capturing thread barriers - for perfmon we are missing an API to catch two values for the run, we would need to check if this got resolved. (mackonstan, 14:40:44)
    9. Peter: start with a new telemetry approach - per packet path analysis, similarly how it's done in NFVbench, see how this could be applied to NFV density tests and actually all other tests. (mackonstan, 14:42:10)
    10. Maciek/Tibor: trending regressions - add announce emails to csit-report. (mackonstan, 14:43:38)
    11. Vratko: anomaly detection - still seeing some noise, more data doesn't seem to be helping, no pattern. Need more inside knowledge, white-box, need more telemetry data from tests to see if any correlation can be found. Affects trending anomaly detection, per patch perf, perf bisecting. (mackonstan, 14:45:25)
    12. Peter/Maciek: vhost/memif - adding vpp-in-container with ipsec. (mackonstan, 14:47:24)
    13. Peter: seeing the new tests being pushed for Load-Balancer, baseline tests (mackonstan, 14:48:43)
    14. last LB is for Maglev (mackonstan, 14:49:25)
    15. Peter: seeing new tests for "NAT44 L3 DSR" (mackonstan, 14:49:30)
    16. Vratko: improve suite generator for heat-map graphed tests e.g. NFV density tests (mackonstan, 14:50:11)
    17. Maciek: any other work in services and L47 space? (mackonstan, 14:50:43)
    18. Juraj: testbeds - Arm - adding more ThunderX machines for vpp_device to run csit-vpp and vpp-csit device tests (mackonstan, 14:54:13)
    19. Juraj: productize per VPP patch (with voting?) vpp-csit device tests for Arm. (mackonstan, 14:54:48)
    20. Goal: add more vpp_device tests for better VPP API coverage, as those are executed per vpp patch and per csit patch (mackonstan, 14:56:54)

  6. Operational status (mackonstan, 14:57:32)
    1. Ed: situation right now - stabilized back to normal - root cause not known. Some issues with vpp_device machines (Peter handling). A simple Registry app "stopped intelligently responding", redundancy didn't kick in. On Registry recovery, all queued jobs kicked off, and overloaded Jenkins with ~160 jobs in the queue (LFN ONAP can handle many more). Jenkins tipped over handling number of requests to Nomad cluster (Nomad can handle (mackonstan, 15:01:40)
    2. Ed: adding more healthchecks to prevent Registry app HA failure. (mackonstan, 15:03:01)
    3. Peter: 10 servers lost mgmt IP addresses, configured as static without DHCP. Unclear how it happened. Root cause analysis in progress. (Some external system interference??)(ONAP servers experienced similar situation this week). (mackonstan, 15:06:54)


Meeting ended at 15:11:10 UTC (full logs).

Action items

  1. (none)


People present (lines said)

  1. mackonstan (46)
  2. collabot_ (4)
  3. jgelety (1)
  4. dwallacelf (1)


Generated by MeetBot 0.1.4.