We are experiencing an outage on our peering platform in Amsterdam. We are working on resolving the issue and will investigate the root cause. Apologies for the inconvenience.
An AMS-IX Story
Manager Delivery and Operations
June 28, 2021
Back in 2019, we were forced to make the decision to migrate one of our core nodes to a new location, and it goes without saying that this would be a migration of magnitudes we have never seen before. Sure, we have migrated from 10G to 100G backbones, installed newer switches and DCI. We even put our core network on a diet, shrinking it down from 6 nodes to only 2. But now we were talking about actually taking a core node, including all darkfibers attached, and moving it several kilometers further. That is a completely different ball-game than the other projects. None the less, it needed to happen so we started by setting up a small “core team” to identify the major steps involved in this project. (Get it? Core Move/Team 😉)
The first question that was raised was of course where we would migrate the core node to, which seemed to become a more challenging question than originally anticipated. There were of course commercial and financial preferences, but from a technical side there were some hard requirements that needed to be met and some additional preferences that would be nice to have.
Based on these questions, we quickly ended up shortening our list down to 3 possible locations, out of which 1 received the highest preference. Digital Realty AMS17, the data tower in the Amsterdam Science Park. Financially and commercially they were simply the better choice, but also technically they came out on top. They would supply all the space and power we need for both the core and the customer PE’s with an even higher security than we requested. Fiber redundancy inside the building is no issue what so ever and they have more than enough routes and fiber suppliers to guarantee options on that front as well.
Technically, they were perfect on the requirements, but they scored some additional points on the “nice to have” list. If we take away the nice view over the Amsterdam skyline when you get out of the elevator, the nice coffee in the canteen area and the fact that it is basically across the street from where AMS-IX was once born, there was one more large benefit we have from choosing this location. The possibility to connect Nikhef and Interxion AMS9 directly by campus fibers, eliminating the need for DCI or any other active equipment in between these links. Making the network a little more stable than it already is.
After making the decision and getting the contracts up and running it was time to sort out the connections between all the PoP locations and the new core node. This job always takes an enormous amount of time and focus as we need to make sure the connections to the two core nodes are not crossed anywhere in the network, otherwise we would end up with a single point of failure. The task to get this done right requires an enormous amount of time and effort from both our engineers as well as the potential supplier. Just like starting out with multiple suppliers for the location, we also ended up in talks with multiple suppliers for the darkfiber network and shortened the list very fast, but it took months to finally find routes that would provide the redundancy we needed and just a single supplier was able to provide this redundancy, Relined.
On top of the redundancy, Relined also made sure that they would splice every single fiber as much as possible, taking the maximum amount of connectors out of the network, which resulted in a serious drop in light losses and the need for amplification on the darkfibers. Again, less active equipment which could fail, making the network more resilient.
Once the darkfibers were ordered and the first ones started to get delivered, a next challenge started, cross-connects. Ordering the required cross-connects sounds like a simple task, but somehow this always ends up in a large amount of work and additional tasks as positions are always occupied by previous customers or they get delivered with tx/rx issues. All this caused many delays in our planning, but there was a bigger challenge just waiting to happen, which nobody could have seen coming and would cause the biggest delay.
During our challenges with getting the darkfibers and cross-connects in place, Covid-19 hit and, soon after, the lockdowns started to be put in place. It goes without saying, this had a serious effect on the migration, and nearly every part of the migration had to be delayed. Suppliers were having issues sending out engineers to perform their jobs and the increases in traffic made them busier than ever before. That same increase in traffic caused for many additional backbone upgrades for the AMS-IX engineers as well, taking away resources that were planned for the migration itself. A curfew was put in place, limiting our ability to travel at night, basically making the migration impossible and on top of that, the different datacenters added a maximum number of people allowed inside the datacenter, limiting our options even more. As you can imagine, a core switch is quite heavy and large so it takes multiple people to put it in place and cable it up. We also saw an increase in sickness leave, from people taking additional precautions when having signs of Covid and staying away from other coworkers because of it. All in all, Covid was the main driver to postpone the migration itself until a later stage.
Another challenge we ran into, was the announcement of our DCI supplier that the systems we use were no longer available for ordering. We knew it would go out of sale at some point, but due to miscommunication, we never received the news that the end of sale date was pushed forward until we did our last order. This became a challenge because we had planned to build a 4 Tbit link between the old and the new core, so we could make sure that all the PoP locations had a redundant connection to our core network at all times.
For this we increased the amount of modules we had in storage, but as traffic increased at amazing amounts, we used a lot more DCI modules to increase the backbone capacity, up to a point where we also needed to take equipment away from the additional stock. In the end, we did not have enough equipment anymore to build the link between the core nodes. This basically meant we needed to create new plans, each of them ending up in either lower redundancy during the migrations or a very much extended timeline, which would take a toll on our resources.
We hate lower redundancy so we decided to extend the timelines for the migration, but just as we were about to accept this new plan, two of our suppliers came together and saved the day. Nokia and Tallgrass heard about our issue and within a short period of time, they arranged for a system capable of handling the 4 Tbit, including an offer to install and maintain the system during the migrations. Needless to say, that this was a very welcome surprise.
On June 26th, during the MORE-IP event, we received word that Nokia and Tallgrass finalised the installation and testing of the link between the two core nodes. Except for some minor outstanding tasks, we were ready to start the migrations. The new core was prepared, the first cabling was put in place as preparation and the configuration files were ready to go. We decided to start small, 1 single PoP location, to test the link and configurations prior to migrating all other locations. In the same cage as the new core node, we also have the PE switches for customers inside the Digital Realty datatower, so this would be the simplest migration with local links only and therefore the perfect place to start. On Monday, June 7th at 00:00 the maintenance started.
NOC diverted all traffic between the Digital Realty PE switches and the EUN core over to the backup path via the Global Switch core node, enabling the onsite engineers to physically disconnect the backbones towards the EUNetworks core and connect the local fibers between the PE’s and the new core at Digital Realty. Once done, NOC enabled the connections one by one, just to make sure all links were up and running and then came the moment of truth. Would the traffic start flowing from Digital Realty, via the interlink to EUNetworks and from there to the rest of the network?
NOC started typing the commands needed to make traffic flow over the newly installed links and the moment they pressed return, there it was.. The first bits started flowing and the utilisation graphs started to rise. Additional checks were performed and we kept monitoring the links for a while to see if there was any strange behaviour, packet loss or errors, but none appeared. This first migration was a success!
To make sure all was good prior to starting the migration of the full network, we decided to leave this setup running for a week and to monitor it closely.
During the week, we kept on monitoring closely and no strange behaviour was noticed. No errors, no packet loss or flapping links. In short, the next migration would be a GO. During that same week of monitoring, the onsite engineers started to prepare for the next migrations. They reused the DCI equipment that was released from the first migration and started to prepare as many connections as possible for the next migration. This way they minimised the amount of work needed during the nights, giving us the ability to migrate multiple sites per night.
On top of this, it was already known that the next sites to be migrated would be Nikhef and Interxion Science Park. Both of which would be connected with point-to-point connections via the campus fibers, releasing an even higher amount of DCI, giving the ability to prepare even more connections for the later migrations. It was for this reason that we added one night without any migrations after we would be done with Nikhef and Interxion.
On Monday, June 14th at 00.00 AM, we started the migrations of Nikhef and Interxion Science Park. These two locations were to be connected via campus fibers, for which we arranged a total of 56 fibers and cross-connects. During the preparation and testing of these connections we came across a lot of issues and weird mappings that needed to be resolved and even though we were sure we had it all sorted out, this still created some additional worries during the migrations. Luckily, all went as expected and because of the countless hours of testing and preparations, there were no issues with the connections at all.
The amount of DCI that was released by the three migrations that were now finalised, was nearly enough to prepare all the migrations to come, and the onsite engineers working the day shift used the extra day we had in our planning to install and cable everything they could.
From this point forward, the physical parts of the nightly migrations would be a lot faster, but at the same time, any potential troubleshooting would be more sophisticated. We prepared the network in a way that only the darkfiber connected to the multiplexer at the PoP location, which was currently connected to EUNetworks, would need to be swapped by the new darkfiber going to Digital Realty. All the cabling was prepared and tested, and the new connection was ready to be used, but given it is a new darkfiber with a different length, these simple steps could easily turn out to become a headache if troubleshooting was needed.
The third night of migrations arrived and even though nobody said it out loud, you could clearly feel we were all a little bit more on edge. This night would be the first in which we would migrate to the new darkfibers and we all knew that this could end up becoming a very long night.
The first two out of three migrations this night went smooth, no big issues other than some tx/rx swaps, but when we started the migration of Interxion AMS5, we did encounter some delays due to an unexpected behaviour on one of the installed multiplexers.
The previous darkfiber had higher losses and therefore we installed an active multiplexer with an amplifier module, but with the new fiber, this would no longer be needed and therefore we prepared to by-pass the amplifier and connect directly on the multiplexer.
We expected the active multiplexer to act like a passive multiplexer if we would not connected it to the amplification module, but this was not the case. In short, we ended up connecting the amplifier as well and decided to investigate at a later stage after the migrations were done.
These findings in the night were communicated with the engineers that performed the preparations during the day, and they adjusted the cabling for the other 2 locations which also had active multiplexers installed.
The 4th night of migrations went ultra smooth with just over 2 hours of work performed to migrate 4 locations, including transit times between the locations and after this night we felt good. One more night to go and to migrate the remaining 3 locations and then start our well deserved weekend after which the cleanup could start.
Unfortunately, the 5th and last night did not go as smooth. The 1st location during this night caused for many issues with modules or links that did not want to come online and it took over 4 hours to get this 1st location migrated. This meant that we still had 2 locations to do with less then 3 hours to do so and the knowledge that 1 more of these location might prove troublesome as it had some additional connections to be migrated.
Luckily the next location went ultra smooth again, leaving us with about 2,5 hours for the last migration, and this last one did give some minor tx/rx issues and an optic that did not feel like working anymore, but at 05:16 AM we received the best message in our chat, “All wdm lights here are green !!!!!”, marking the end of the migrations.
The migrations were done… The heaviest part of the project was successful… And in all honesty, it went smoother than anyone expected it would go, especially because of the many issues we encountered during the preparations. I think I
speak type for us all when I say that it was that moment that we realised we just hit a major milestone in one of the biggest migrations in AMS-IX history, a project with so many headaches and setbacks during preparations, but so smooth in execution. At this stage it became clear that the dedication, expertise and perfectly aligned teamwork paid off.
Off-course, the project itself was in no way finished yet. We still had all the edge switches that needed to be connected back to the potato core node and a full management network to connect and then startup the cleanup, but all of these tasks were done in the week after the migrations, during normal working hours.
Again, we did have some issues here and there, like the need for additional or less attenuation due to the new darkfibers, the management network that went into chaos mode and swapped over to its backup path, but they were handled fast and adequately by our team of professional and experienced Engineers!
It goes without saying, that this accomplishment demanded a little celebration with a nice chocolate cake by my wife Tina. Too bad Kostas could only join virtually and needed to watch how we ate this really nice cake. (See team pictures below)