Amsterdam

LOCATION
Amsterdam

  • Platform
  • Services
  • Technical
  • About
Current
8.312 Tb/s
Peak
14.073 Tb/s

Platform Incident

Between 19:08 CET and 23:04 CET on November 22, 2023, and from 09:38 CET to 10:25 CET on November 23, 2023, customers connected to the Amsterdam infrastructure experienced active flapping of LACP and BGP sessions. At the peak, the total traffic passing through the AMS-IX platform dropped to 2.1 Tb/s. BGP sessions were reduced from 885 to 550 (IPv4) and 800 to 450 (IPv6).

Below is comprehensive overview of the incident, including root cause identification and details of the subsequent measures taken to resolve it. Additionally, we are sharing follow-up actions aimed at limiting the possibility of a similar event in the future.

Incident Summary

Root Cause

LACP Leakage originated from a Juniper PE switch (stub-eq5-247), which actively propagated LACP packets from customer-owned equipment.

Consequences

These LACP packets triggered the teardown of LACP LAGs for other customers on both Juniper and Extreme (SLX) switches. The resulting flapping induced resource starvation and full buffers, leading to RSVP timeout errors. Impacted Juniper PEs aggressively sent RSVP Path Error messages introducing further issues on the Extreme SLXs.

Why did this Happen

Even though mitigation techniques, such as applying LACP access control lists on all LACP links, were in place, the LACP generating customer equipment was connected to a non-LACP link on a Juniper switch.

The Juniper out-LACP ACL was not fully operational, resulting in flapping sessions affecting other customers on the Juniper PE. The investigation uncovered that the out-LACP ACL on the Extreme SLXs did not perform as anticipated. AMS-IX Engineers confirmed the ACL's past functionality, but it remains unclear whether this deviation is due to a bug or a change in syntax since the last SLX version upgrade.

Follow-up Actions

  • During the incident, an ACL was applied to non-LACP links, blocking LACP frames originating from static-LAGS and non-LAG customers.
  • A detailed postmortem was conducted by AMS-IX on 24/11/2023.
    The LACP ACL creation logic was enhanced in the provisioning stack to ensure newly created links have the ACL applied.
  • AMS-IX will review and update the out-LACP ACL on Juniper and Extreme SLX. Engineers will confirm that the ACL performs as designed.
  • Investigation is underway to implement alerts for observing Slow Protocol BPDUs on the platform.
  • The tech-l communication policy will be revised, incorporating rules on the frequency of updates from AMS-IX based on incident severity.
  • Additional follow up actions are defined to enhance AMS-IX’s internal communication and processes during an incident.
    We apologize for any inconvenience caused by this outage.

Subscribe to our newsletter

Got a question?