LOCATION
Bangkok

Platform
Services
Technical
About

Current

9.32 Gb/s

Peak

35.85 Gb/s

ASNS

GET STARTED

Platform
Services
Technical
About
Statistics
My AMS-IX

GET STARTED

AMS-IX locations

Powered by AMS-IX

From Proprietary to Open-Standards Fabric

An AMS-IX Story

Garri Djavadyan

Senior Network Engineer

Recently, we completed an upgrade of our management network after an extended period of preparation. While the upgrade was necessary to maintain support coverage, it also gave us the opportunity to reflect on operational lessons learned from the previous network. This article provides a brief overview of the original design, its limitations, the decisions made to address them, and the migration process.

How it started

The story begins in 2022, when after a major upgrade of our data centre fabric based on Pluribus Netvisor to the latest supported version, we learned that new major Netvisor versions would no longer support our white-box switches Dell S4048(T)-ON that are based on the Broadcom Trident 2/2+ platform. Although it was disappointing, particularly after a year spent preparing the old fabric for the upgrade, we had to accept the situation and move forward.

Initially, we determined that the most cost-effective and fastest way to obtain a fully supported fabric would be to reuse our Dell switches by installing Dell’s own network operating system (NOS) OS10 on them. Our respective lab test results were promising. However, it turned out that our hardware support period was tied to the purchase date irrespective of the general end of hardware support (EoS) date, limiting our maximum support coverage only until mid 2023, which was not very useful.

At that point, it became clear that we would need to go through the full-fledged product selection process. Although this realisation was painful, it ultimately gave us the freedom to select a solution best aligned with our new requirements and capable of supporting us for at least seven years.

A good chance to get rid of old limitations

While defining the requirements for the new management fabric, it was unanimously agreed that the new solution must be based on standards to overcome limitations we had experienced while operating the Pluribus Netvisor-based fabric. This section will discuss some of the most prominent limitations we had been facing.

A few words about Pluribus Netvisor

When Pluribus Netvisor was introduced to AMS-IX, it was a revolutionary solution offering an unconventional approach for managing VXLAN fabrics, by distributing the function of the controller to every node of the fabric. This meant that the whole fabric could be monitored and managed from any node, without the need for any external controller/orchestrator. For example, one could simply check from which switch and port a specific IP or MAC address was learned by entering a command like “port-show ip 2001:db8::1” or “port-show mac 02:11:22:33:44:55”.

Similarly, service provisioning operations were simplified too. Even though VXLAN tunnels were static, the provisioning complexity was completely under the hood. For example, to extend tunnels for a specific VXLAN to a new leaf from all the leaves already participating in the VXLAN, a simple command like “vtep-vxlan-add name leaf-24 vxlan 100” could be used.

However, as it is often the case, this convenience came at a cost which is discussed below.

Software upgrades

Considering that every Pluribus Netvisor node is a hard-linked part of the fabric, every node must run the same NOS version. This requirement complicates upgrades significantly. Most importantly, this requires upgrading and rebooting the whole fabric during upgrade maintenances as doing it node by node would result in two or more isolated fabric instances, not compatible with each other.

Because many things can go wrong when the whole network is rebooted into a new major version, a lot of efforts must be put in validating the upgrade procedure and the switches’ readiness for the upgrade. For instance, together with the Pluribus Networks engineers, it took several months for us to prepare our switches for a major version upgrade that took place in 2021.

Namely, we had to RMA several switches due to degraded SSDs metrics. We also had to upgrade the SSD firmware from the ONIE environment on all the switches and physically power-cycle them for the changes to take effect. Similarly, some switches needed physical power-cycling due to the signs of an unstable I2C bus. In addition, we had to re-install the NOS on several switches because some drives were partitioned differently during the initial setup. All that was needed to eliminate potential risks of losing nodes during the major upgrade.

What made the upgrade maintenance less stressful is the fact that in addition to the management network, we also maintain an independent network of terminal servers that allows us to manage networking devices via serial ports when the Ethernet network is unavailable. Therefore, during the Big Bang night, we had 23 serial sessions open to all management switches to monitor and, if necessary, control the process. As an additional precaution, several images of Isidore of Seville, the patron saint of the internet, were printed and placed on the NOC walls.

All our preparation efforts paid off, and the upgrade process went smooth. However, there was a consensus on that the next generation of the management network must support less impactful and less complicated upgrades.

_{*Image: Saint Isidore of Seville from Portal of National Archaeological Museum of Spain projected by architect Francisco Jareno and built from 1866 to 1892.}

Adding new nodes

Adding new management nodes was getting increasingly difficult over the years. To join a new node to the fabric means that a node must replay all the transactions that the fabric has seen in its lifetime. For example, if you had to add a new switch, it had to create and delete fabric-wide objects such as VXLANs and respective tunnels that were in use, even temporarily, since the creation of the fabric. Repeating around 2k transactions (as of 2022) was extremely unreliable, and newly added nodes were crashing before they could successfully replay all fabric transactions, making further expansion impractical.

Fabric context

It was already mentioned that the distributed fabric functionality allowed NOC engineers to run commands in the fabric-wide context, and it was extremely useful for read/get operations. Now consider the impact if an operator forgot to switch into a local context before disabling a port or VLAN on a specific switch. Yes, it happened at least once: the same port was disabled on all the nodes at the same time, and this did cause an internal incident. Also, the fact that the CLI was in fabric-wide context by default did not help at all.

Offline nodes

Because the fabric configuration must be consistent among all the nodes, fabric-wide write operations to the configuration database are blocked when there are offline nodes. For example, if the node C is temporarily offline, Netvisor will not allow adding new VXLAN tunnels between the nodes A and B even if there are no intentions to provision any new tunnels on the node C. Therefore, any node added to the fabric was considered an important member of the fabric irrespectively of the importance of the node’s role in general.

MLAGs and cluster links

MLAGs offer a way to elegantly multihome end devices to multiple leaves. Traditionally, vendors have had their own proprietary implementations, but the idea is that from the end devices’ perspective, they have a normal LACP LAG of two or more ports that are connected to different switches.

In most cases, switches offering multihoming services to edge devices must synchronise states with each other, so a dedicated cluster link/LAG is normally used for that. That is not an exception for Pluribus Netvisor. Not only do such links involve extra ports and cables, but they also become critical points of the whole fabric. If the cluster link is down for some reason, the switches enter a “split-brain” situation when they assume they are alone, so they start forwarding traffic that they might not be supposed to, introducing loops and storms.

On the other hand, the EVPN standard was designed with multihoming in mind, so MLAGs (or Ethernet Segments (ES) in EVPN parlance) are supported natively and without the need for dedicated cluster links as the ES states are synchronised with the EVPN signalling naturally. That was one of the major reasons to define EVPN as a hard requirement for the next generation of the fabric.

Fixed RJ45 ports

Historically, virtually all AMS-IX devices connected to the management network had used copper cables. Therefore, all leaves were based on Dell S4048T-ON switches, with fixed RJ45 ports. Over time, devices needing fibre connections were added, and to onboard them, we had to involve QSFP+ ports with QSFP+ to SFP+ adapters (QSA) and SFP+ transceivers. It was not scalable as those switches had only six QSFP+ ports and two of them were already used for spine connectivity. Moreover, QSAs became potential points of failure.

Therefore, it was decided to use leaf switches equipped exclusively with (Q)SFP+/28 ports, supporting both copper and fibre transceivers, to provide maximum flexibility in the next generation of the management fabric.

Interoperability

There is little to note here, except that because the fabric was managed by a proprietary distributed controller, it could not be extended with other NOSes. It is fair to mention that starting from Netvisor version 6.1.0, EVPN support was added, but the version 6 was not supported by our hardware.

New Nokia SR Linux-based fabric

After a long selection process, we chose Nokia SR Linux NOS running on 7220 IXR-D2L switches as a platform for our new management network. Among other useful features, we found the following ones very relevant to us:

strict and complete enough EVPN/VXLAN implementation;
clean NOS with only relevant functionality implemented or ported;
NOS built with automation and extension in mind;
availability of freely distributed container images for lab tests closely matching the versions running in production;
powerful CLI; and
big active community behind SR Linux.

Standards-based fabric

Our new management fabric is now based on standards: EVPN/VXLAN. Namely, on RFC7432, RFC8365, and updates to them. This is also our first deployment of EVPN in production, providing an excellent opportunity to gain familiarity with the technology and hands-on experience with its operational aspects before we also start migrating our IXP platforms from VPLS to EVPN in the future.

Even though it might seem like a downgrade after having a (distributed) controller-based fabric, EVPN simplifies operations that we usually depend on. For instance, all learned MAC addresses are now present in BGP tables, so we can still easily track them fabric-wide. Similarly, with the help of EVPN ARP- and ND-proxy functionality, we can track IPv4 and IPv6 addresses. EVPN also takes care of establishing VXLAN tunnels, so only one device needs configuration when a VXLAN is extended to a new leaf. Moreover, considering the flexibility of the NOS, the CLI can be extended with new commands, or the entire fabric can be managed externally using either commercial or homegrown controllers and orchestrators.

EVPN alone deals with most of the limitations of our old fabric mentioned before. Namely, upgrades can now be done node by node, without the need to run the same NOS version fabric-wide. This allows us to upgrade devices faster and with less effort, giving us a chance to react faster to discovered bugs and vulnerabilities. This also gives us a chance to upgrade the fabric in several stages: from less critical units to more critical ones, making sure that new software is stable enough before it lands on the most critical devices.

Adding new devices is not a challenge anymore. EVPN/VXLAN has already demonstrated its scalability in the data centre community.

It was already mentioned that EVPN deals with multihoming natively. As a result, we no longer need to explicitly form clusters from specific leaves and to maintain cluster links between them. In addition, we can now offer to our internal teams not only active-active but also all-active LAGs that are connected to more than two arbitrary leaves.

Finally, the completed migration to EVPN means that the fabric is no longer vendor locked, and if necessary, can be extended with compliant products from other vendors.

Future-proof solution

Even though the spine links that we use on the new switches are 10G-based, the switches have 8 x 100G (QSFP28) ports. Therefore, if at some point we will need 100G uplinks, we can replace our 7220 IXR-D2L spines with 7220 IXR-D3L models, keeping the same leaves.

The service ports now support speeds up to 25G (SFP28), which is far more than we need today. However, it gives us an option to upgrade our server links using relatively cheap 25G transceivers in the future.

Migration

To our surprise, the migration itself went smoother and faster than we expected.

No manual configuration was required during the migration. We modelled the fabric-wide Pluribus Netvisor configuration into a single JSON file that was generated by a script from our migration tools. In fact, we did not even need to enable the API server on any of the Netvisor devices: delimited non-interactive CLI output requested with “parsable-delim ;” was enough to export all needed fabric data in machine-friendly way.

Then, another script from our migration tools prepared target JSON-based configuration for each switch using the previously exported Netvisor data as its input. Later, using the zero-touch provisioning (ZTP) functionality of the new switches, we created a pipeline that could handle 4-5 switches at a time that were upgrading themselves to our golden NOS version and applying their own target configuration based on their serial numbers.

Fun fact: the migration was taking place when we were moving to a new office, so our usual ZTP infrastructure was not available in the old office anymore. However, we “exploited” the flexibility of the SR Linux switches and repurposed one of them into a normal Debian server that was running DHCP and HTTP daemons needed for the ZTP process.

To perform the migration seamlessly, with minimal disruptions, we linked both fabrics at the overlay layer, meaning that from each fabric’s perspective, the other fabric was just a normal Q-tag-aware device residing in all fabric VLANs.

Each migration day began by moving the links of the scheduled leaves from one of the old spines to one of the new spines. This temporarily reduced redundancy on the old leaves but freed links for installing new leaves, keeping both old and new leaves online. The migration then proceeded cable by cable from old to new leaves. At the end of the day, the remaining spine links of the migrated leaves were moved to the new spine, restoring full redundancy. The picture below shows the idea.

Our monitoring system was ready to process performance counters of the new devices as soon as they came online, and this was very helpful as at one site we could identify and fix a packet loss issue which was caused by a dirty fibre patch-cord just before leaving the site.

Conclusion

Changing a network design and platform is rarely the most exciting task, especially in a production environment. Yet sometimes there is no alternative. Large-scale changes like this also provide a valuable opportunity to address the shortcomings of previous designs and to prepare for the challenges of a rapidly evolving technological landscape.