Between 19:08 CET and 23:04 CET on November 22, 2023, and from 09:38 CET to 10:25 CET on November 23, 2023, customers connected to the Amsterdam infrastructure experienced active flapping of LACP and BGP sessions. At the peak, the total traffic passing through the AMS-IX platform dropped to 2.1 Tb/s. BGP sessions were reduced from 885 to 550 (IPv4) and 800 to 450 (IPv6).
Below is comprehensive overview of the incident, including root cause identification and details of the subsequent measures taken to resolve it. Additionally, we are sharing follow-up actions aimed at limiting the possibility of a similar event in the future.
LACP Leakage originated from a Juniper PE switch (stub-eq5-247), which actively propagated LACP packets from customer-owned equipment.
These LACP packets triggered the teardown of LACP LAGs for other customers on both Juniper and Extreme (SLX) switches. The resulting flapping induced resource starvation and full buffers, leading to RSVP timeout errors. Impacted Juniper PEs aggressively sent RSVP Path Error messages introducing further issues on the Extreme SLXs.
Why did this Happen
Even though mitigation techniques, such as applying LACP access control lists on all LACP links, were in place, the LACP generating customer equipment was connected to a non-LACP link on a Juniper switch.
The Juniper out-LACP ACL was not fully operational, resulting in flapping sessions affecting other customers on the Juniper PE. The investigation uncovered that the out-LACP ACL on the Extreme SLXs did not perform as anticipated. AMS-IX Engineers confirmed the ACL's past functionality, but it remains unclear whether this deviation is due to a bug or a change in syntax since the last SLX version upgrade.