AMS-IX Port Configuration Hints

(version 1.28, 2008-07-09)

AMS-IX NOC

Amsterdam Internet Exchange, B.V.

E-mail:

This article gives some pointers towards setting up your device when connecting to the AMS-IX. AMS-IX rules restrict the type of traffic and number of source MAC addresses that any member is allowed to send to the exchange. The AMS-IX platform is build around photonic cross connects, Layer 1 switches, which introduce short link flaps for the customers with 10GE connections.

How to prevent those flaps from influencing your session and how to configure your interface towards AMS-IX to only send allowed traffic towards the exchange will be described in this article.


Table of Contents
1. Introduction
1.1. Definition of Terms
2. The AMS-IX Topology
2.1. General 10GE Specifics
3. General Configuration Recommendations
3.1. IPv4 ARP / IPv6 Neighbor Timeout
3.2. Peering LAN Prefix
3.3. BGP Routing
4. Allowed Traffic Types and Configurations
4.1. Physical L2 Topology
4.2. Commonly Seen Illegal Traffic and Setup
5. Cisco Configuration Hints
5.1. Global Config
5.2. Interface Config
5.3. Layer 2 Config
5.4. Cisco Aggregated Links
5.5. Cisco 10GE Specifics
5.6. IPv6 Config
6. Extreme Networks Configuration Hints
6.1. L2 Configuration
6.2. L3 Configuration
7. Force10 Configuration Hints
7.1. Force10 10GE Specifics
8. Foundry Configuration Hints
8.1. Foundry Aggregated Links
8.2. Foundry 10GE Specifics
9. Juniper Configuration Hints
9.1. Unicast BGP Configuration
9.2. IPv4 ARP Cache Timeout
9.3. Juniper Aggregated Links
9.4. Juniper 10GE Specifics
10. Linux Configuration Hints
10.1. ARP Filtering and Source Routing
10.2. IPv4 ARP Cache Timeout
10.3. IPv6 Neighbor Cache Timeout
10.4. RP Filter Setting
10.5. Running the “sysctl” Commands at Boot
10.6. Linux Aggregated Links
11. Riverstone Configuration Hints
12. Acknowledgements

1. Introduction

The Amsterdam Internet Exchange operates as a shared Layer 2 (L2) Ethernet infrastructure. Large Ethernet LANs require that more or less everyone plays by the same set of rules. In other words, it can be quite sensitive to misbehaviour.

In order to improve the stability of the Exchange, AMS-IX has defined a set of rules to which every member's connnection must adhere, the Technical Specifications.

Not everybody immediately grasps the subtleties of configuring equipment to adhere to the rules, so this document tries to fill in some blanks and provide examples and hints for the most common equipment.


2. The AMS-IX Topology

The AMS-IX network is built as a redundant hub & spoke topology using Glimmerglass photonic cross-connects and Foundry Networks switches. A hub & spoke topology has one core switch and multiple access switches connected to it (see figure below).

Naturally the core and each connection between access and core switch are a single point of failure. Therefore AMS-IX implemented a redundant hub & spoke topology (see figure below) which is built around two sets of core switches: one at Global Switch and one at EUNetworks. Only one set of switches is active at a time (either red or blue connections) and it allows us to swap between the topologies in case of a maintenance or failure on one of the sides.

Customers up to 1GE are directly connected to Foundry Networks edge switches, available at each location. One can connect with 100Mb via UTP or 1GE via multimode or singlemode fiber. Fiber connections are supported using SX or LX optics, and in some cases also LH-A or LH-B.

10GE customers connect to the AMS-IX platform via Glimmerglass photonic cross-connects, Layer 1 switches. Those Layer 1 switches connect the customer to one of the two available Ethernet switches, which one in particular depends on the at that time active part of the topology. The 10G Ethernet access switches are locally available at each location and one can connect with either ER or LR optics.


3. General Configuration Recommendations


4. Allowed Traffic Types and Configurations

The Technical Specifications state the following:

  1. There are only three ethertypes allowed:

    1. 0x0800 - IPv4

    2. 0x0806 - ARP

    3. 0x86dd - IPv6

    This implies IEEE 802.3 compliance, not 802.2, so no LLC encapsulation!

  2. Only one MAC address allowed on a port, i.e. all frames sent towards the AMS-IX should have exactly one unique MAC address.

  3. The only non-unicast traffic allowed is:

    • Broadcast ARP.

    • Multicast ICMPv6 Neighbour Discovery (ND) packets. (NOTE: this does not include Router Advertisement (ND-RA) packets!)

  4. AMS-IX member equipment should only reply to ARP queries for IP addresses of their directly connected AMS-IX interface. In other words, proxy ARP is not allowed.

  5. Traffic for link-local protocols is not allowed, except for ARP and IPv6 ND (see above).

  6. IP packets addressed to AMS-IX peering LAN's directed broadcast address shall not be automatically forwarded to AMS-IX ports.

  7. The speed and duplex setting of 10baseT and 100baseTX ports must be statically configured, i.e. auto-negotiation should be disabled.


4.1. Physical L2 Topology

The AMS-IX rules dictate that only one MAC address is allowed behind a port. This means that you have to be extremely careful when connecting a device that can act as a L2 device. In general, we do not recommend using L2 devices between a member's router and the AMS-IX switch, except when used as a media converter.

The reason for allowing only one MAC address that we want no additional L2 network behind the AMS-IX ports. Extended L2 networks are not under the control of the AMS-IX, but instabilities in a L2 network behind the AMS-IX switches can and typically do have a negative impact on the whole exchange. Forwarding loops and spanning tree topology changes are good examples of this. By enforcing the one-MAC-address-per-port rule, we effectively prevent forwarding loops and STP traffic from intermediate L2 devices.

In short, an intermediate L2 device may only bridge frames from the member's router to the AMS-IX port (so we see only one MAC address) and should otherwise be completely invisible. No connected device should bridge frames from other devices onto the AMS-IX, or talk STP on its AMS-IX interface.


4.2. Commonly Seen Illegal Traffic and Setup

Any traffic other than the types mentioned in the previous section is deemed to be illegal traffic. In this section we will list some of the more common types of violations we see at the AMS-IX and give some arguments as to why it is considered unwanted.


5. Cisco Configuration Hints

Cisco's philosophy seems to be similar to that of some PC OS vendors: enable as many protocols and features as possible by default, so the device works out-of-the-box in most situations. Unfortunately, this means that a lot of unnecessary features are turned on that, while harmless in LAN or corporate environments, can cause undesired traffic on an Internet exchange.

Typical things that need to be disabled are: autoconfiguration protocols (DHCP, BOOTP, TFTP config download over the AMS-IX interface), CDP, DEC MOP, IP redirects, IP directed broadcasts, proxy ARP, IPv6 Router Advertisements, keepalive.

Intermediate switches or hybrid devices will also need to disable VTP, STP, etc.


5.1. Global Config

! Do not run a DHCP server/relay agent
no service dhcp

! Older IOS versions require this instead of the above.
no ip bootp server

! Do not download configs through TFTP
no service config

! Do not run CDP
no cdp run

5.2. Interface Config

! Don't do redirects -- if they don't know 
! how to route properly, tough luck!
no ip redirects

! Don't run proxy ARP on your AMS-IX interface
no ip proxy-arp

! Don't run CDP on your AMS-IX interface
no cdp enable

! Directed broadcasts are evil.
no ip directed-broadcast

! Disable the DEC drek if you haven't done so globally yet.
no mop enable

! For (Fast)Ethernet: no auto-negotiation on your connection.
! no negotiation auto
! duplex half
duplex full

! L2 keepalives are useless on the AMS-IX
no keepalive

5.3. Layer 2 Config

It is difficult to give a complete guide for Cisco products, because of the many different types of devices and (IOS) software versions. When in doubt, consult your documentation.


5.3.1. 29xx and 35xx Series

If you use a Cisco Layer 2 device (such as the 2900 and 3500 series), you have to turn off VTP (VLAN Trunking Protocol), DTP (Dynamic Trunking Protocol), LLDP, and UDLD.

In global config mode:

vtp mode transparent
!
no spanning-tree vlan 1200
! If you don't need LLDP, disable globally
no lldp run
! If you don't need CDP, disable globally
no cdp run
!
vlan 1200
 name AMS-IX
!
interface IfIdent
 description Interface to AMS-IX
 switchport access vlan 1200
 switchport mode access
 switchport nonegotiate
 no keepalive
 speed nonegotiate
 no udld enable
 ! If CDP has not been disabled globally:
 no cdp enable
 ! If LLDP has not been disabled globally:
 no lldp receive
 no lldp transmit
 ! If you do not want to shut off STP:
 spanning-tree bpdufilter enable
end

5.3.2. Catalyst 6500 Series

CatOS and IOS are different beasts, so for Catalyst switches, the following applies:

set vtp mode off
set port name IfIdent My AMS-IX Port
set cdp disable IfIdent
set udld disable IfIdent
set trunk IfIdent off dot1q
set spantree bpdu-filter IfIdent enable
set vlan 1200 name My_AMS-IX_Vlan
set vlan 1200 IfIdent

If, for some reason, you cannot afford to turn off VTP globally, the only way to turn it off on individual ports seems to be by using l2pt:

set port l2protocol-tunnel IfIdent vtp enable

Depending on your CatOS platform, you may or may not be able to do this.


5.4. Cisco Aggregated Links

5.4.1. Catalyst 6500 Series

Configure the port-channel as on, not negotiate or desirable as the AMS-IX switches do not have LACP enabled nor do they speak PAgP.

Some modules do not support more than 1 Gbps of traffic under certain conditions across an aggregated link. Please see the Cisco documentation for more details.

Load-balancing over four ports may result in an unequal distribution due to bug CSCsg80948.

! Here is an example configuration:
interface GigabitEthernet1/1
 description AMS-IX Link 1
 no ip address
 no ip redirects
 no ip proxy-arp
 no keepalive
 no cdp enable
 channel-group 1 mode on
!
interface GigabitEthernet1/2
 description AMS-IX Link 2
 no ip address
 no ip redirects
 no ip proxy-arp
 no keepalive
 no cdp enable
 channel-group 1 mode on
!
interface Port-channel1
 description AMS-IX aggregated link
 ip address 195.69.14x.y 255.255.254.0
 no ip redirects
 no ip proxy-arp
 no keepalive
!

5.4.2. GSR Series

Do not set a static MAC address on the Port-channel interface. This causes CEF inconsistencies and other assorted failures.

Link aggregation and IPv6 do not seem to play well together. Cisco advises against trying this.

Some changes will result in a different MAC address getting chosen for the aggregated link (likely such as reloading a linecard, if it contains the first port in the bundle). This will keep your ports dysfunctional due to port security on the AMS-IX switches and you will have to contact the AMS-IX NOC in such cases to fix this.

Some restrictions apply to what features are supported on link bundles (e.g. sampled NetFlow only on ISE/Engine4+; no uRPF). Also not all line cards support link bundling, and if traffic towards AMS-IX comes in on such an interface you will experience suboptimal load-balancing. Please see the Cisco documentation for more details.

Support for link bundling on Engine 5 linecards will come in 12.0(33)S.

Cisco Engineering have a special train called "Phase 3" (lb-eft-ph3) that is purported to also provide functionality such as MAC address accounting for Port-Channel interfaces. This seems to have been integrated into 12.0(32)S, but IPv6 does not seem to be supported yet.

Below follows a list of Cisco Bug IDs (ddts) related to link aggregation that you need to consider when choosing an appropriate IOS image.

  • CSCee27396

    present in 12.0(26)S1; fixed in 12.0(26)S3, 12.0(27)S2, 12.0(28)S1, 12.0(30)S

    Symptoms: Over 90% CPU usage by CEF Scanner on all linecards and %TFIB-7-SCANSABORTED errors occur when configuring a link bundle. Also, the router sends traffic to MAC addresses taken from its ARP table seemingly at random, instead of to the appropriate next-hop's MAC address.

  • CSCef12828

    present in post-CSCee27396; fixed in 12.0(26)S4, 12.0(27)S3, 12.0(28)S1, 12.0(30)S

    Symptoms: When traffic passes through a router, the router blocks traffic for certain prefixes behind a port-channel link.

  • CSCdz33664

    present in 12.0(25)S3, 12.0(26)S1, 12.0(27)S2, 12.0(28)S; fixed in 12.0(25)S4

    Symptoms: An HSRP state change on any Engine2 interface causes a microcode bundle flap on all other Engine2 linecards, preventing load balancing to work due to vanilla microcode getting loaded.

  • CSCee81071

    present in 12.0(26)S3, 12.0(27)S2, 12.0(29)S

    Symptoms: Router sends Ethernet frames with a source MAC address of beef.f00d.beef and destination MAC address f00d.beef.f00d (which is the pattern scribbled in unallocated memory in GSR linecards), with what looks to be a legitimate payload of transit traffic. This is one of the symptoms of CSCee27396.

  • CSCeb38014

    present in 12.0(26)S5; fixed in 12.0(26)S5, 12.0(27)S

    Symptoms: The BGP Router process flushes the BGP tables for each peer when you change one neighbor's description. This pegs the GRP CPU at 99% for quite a while.

  • CSCeg31951

    present in 12.0(31)S; fixed in 12.0(31)S2 (CSCei53226)

    IOS (at least in the PRP code) places each individual public peer in its own update-group if remove-private-as is configured on a peer. Needless to say, this scales badly for a router connected to an Internet exchange. (Try "show ip bgp replication".)

A collection of hearsay follows for recent IOS images for the GSR/PRP regarding link aggregation. AMS-IX does not run any GSRs. Please take this information with appropriately-sized grains of salt.

  • 12.0(24)S2 is not advisable (not many specifics known but they include CSCef89562 and CSCee33045)

  • 12.0(24)S6 boots but load-balancing is completely off

  • 12.0(25)S* until S3 have CSCdz33664

  • 12.0(26)S* until S4 have CSCef89562, where Engine4+ linecards can have continuously flapping interfaces, but is also somewhat required for Quadra linecards

  • 12.0(26)S3 has CSCee27396 integrated but not CSCef12828, which leads to traffic blackholing

  • 12.0(27)S* until S3 have CSCef89562 as well

  • 12.0.(27)S1 has a problem where it sends traffic to random destinations

  • 12.0(27)S2 has CSCee27396 integrated but not CSCef12828

  • 12.0(27)S4 reportedly works reasonably well on PRP2s

  • 12.0(28)S1 has problems with Engine2 linecards (CSCef78098) and Engine4+ (CSCef89562)

  • 12.0(28)S2 reportedly works better but still sometimes emits beef.f00d.beef frames on normal ports with only an IPv6 address configured

  • 12.0(30)S has only been observed to exhibit CSCef12828-like symptoms in conjunction with broken hardware, and also to still sometimes emit frames from MAC beef.f00d.beef.

  • Routers occasionally still send out frames with beef.f00d.beef as MAC source address on interfaces with an IPv6 but no IPv4 address configured, even on regular links.

  • Due to the massive amount of feature requests there will be both a 12.0(32)S and a new 12.0(32)SY train.

You can check for incorrect next-hops by attaching to the linecard and executing show controllers rewrite and show adjacency internal and comparing the two rewrite strings for a certain peer's IPv4 address (suffix the commands with | begin 195.69.14a.b). The first six bytes of the returned long hex string should be the peer's MAC address, and equal for all three occurrences.

! An example configuration follows:
!
interface Port-channel1
 description AMS-IX Aggregated Link
 ip address 195.69.14x.y 255.255.254.0
 no ip redirects
 no ip directed-broadcast
 no ip proxy-arp
 channel-group minimum active 1
 no channel-group bandwidth control-propagation
 hold-queue 150 in
!
interface GigabitEthernet1/2/1
 no keepalive
 no negotiation auto
 channel-group 1
 no cdp enable
!
interface GigabitEthernet1/2/2
 no keepalive
 no negotiation auto
 channel-group 1
 no cdp enable
!

Specifying a hold-queue value is optional, but setting it to the amount of ports in an aggregated link multiplied by 75 is advised.

show interfaces Port-channel 1 will display keepalives enabled even though they are not; also, the BIA (burnt-in address, shown as 0000.0000.0000) can be ignored.

Please contact the AMS-IX NOC if you disable autonegotiation on Gigabit Ethernet ports as we may have to explicitly configure our switch for this.


5.5. Cisco 10GE Specifics

IOS supports no bgp fast-external-fallover and event dampening. The no bgp fast external-fallover tells the device to not act immediately on link flaps but wait for the BGP hold timers to expire before resetting sessions.

Newer versions of Cisco IOS even support ip bgp fast-external-fallover deny in a per-interface context.

Note that in practice we have found that the previously advised carrier-delay does not work as expected on Cisco equipment. We suggest you disable fast-external-fallover instead.


5.6. IPv6 Config

Responses on a ICMPv6 multicast listener queries result in bursts of ICMPv6 multicast listener reports. To prevent this configure no ipv6 mld router in interface context. Some other per-interface commands we recommend on a Cisco device:

! disable ICMPv6 multicast listener reports
no ipv6 mld router

! disable IPv6 multicast forwarding
no ipv6 mfib forwarding

! v6 ND-RA is unnecessary and undesired
ipv6 nd suppress-ra

! disable PIM on a specified interface
no ipv6 pim 

6. Extreme Networks Configuration Hints

CautionUpdating Firmware in an EAPS Environment
 

When updating firmware in an Extreme Networks EAPS environment, be sure to temporarily disable your AMS-IX port(s). TFTP file transfers may cause EAPS instabilities resulting in bogus traffic. This is likely to trip the port security on the AMS-IX switches, which may result in 10 minutes downtime.

Most people who use Extreme equipment do not have problems with their AMS-IX connections, some do. We would appreciate feedback from people running Extreme equipment on how they configure their AMS-IX facing side.


6.1. L2 Configuration

The configuration fragment below shows how to configure an intermediate L2 switch, which is also part of an EAPS ring. Port 1 is connected to the AMS-IX switch. Ports 2 and 3 are in the ring. The router is somewhere in that ring, in the “amsix” VLAN.

create vlan "ring" 
configure vlan "ring" tag 1200     # VLAN-ID=0x4b0  Global Tag 3
configure vlan "ring" qosprofile "QP8" 
configure vlan "ring" add port 2 tagged
configure vlan "ring" add port 3 tagged

create vlan "amsix" 
configure vlan "amsix" tag 1700     # VLAN-ID=0x6a4  Global Tag 9
configure vlan "amsix" add port 1 untagged
configure vlan "amsix" add port 2 tagged
configure vlan "amsix" add port 3 tagged

configure port 1 auto off speed 1000 duplex full
configure port 2 auto off speed 1000 duplex full
configure port 3 auto off speed 1000 duplex full

disable edp port 1
disable igmp snooping 
disable igmp snooping with-proxy

create eaps "ring-eaps"
configure eaps "ring-eaps" mode transit
configure eaps "ring-eaps" primary port 2
configure eaps "ring-eaps" secondary port 3
configure eaps "ring-eaps" add control vlan "ring"
configure eaps "ring-eaps" add protect vlan "amsix"
enable eaps "ring-eaps"

6.2. L3 Configuration

The configuration fragment below shows the relevant configuration information for a L3-only device. As in the previous example, port 1 is connected to the AMS-IX and is configured in the “amsix” VLAN (untagged).

#
# Config information for VLAN amsix.
#
create vlan "amsix" 
configure vlan "amsix" tag 1200     
configure vlan "amsix" protocol "IP"
configure vlan "amsix" ipaddress 195.69.14X.Y 255.255.254.0 
configure vlan "amsix" add port 1 untagged
#
configure port 1 display-string "AMS-IX"
disable edp port 1
#
enable ipforwarding vlan "amsix"
disable ipforwarding broadcast vlan "amsix"
disable ipforwarding fast-direct-broadcast vlan "amsix"
disable ipforwarding ignore-broadcast vlan "amsix"
disable ipforwarding lpm-routing vlan "amsix"
disable isq vlan "amsix"
disable irdp vlan "amsix"
disable icmp unreachable vlan "amsix"
disable icmp redirects vlan "amsix"
disable icmp port-unreachables vlan "amsix"
disable icmp time-exceeded vlan "amsix"
disable icmp parameter-problem vlan "amsix"
disable icmp timestamp vlan "amsix"
disable icmp address-mask vlan "amsix"
disable subvlan-proxy-arp "amsix"
configure ip-mtu 1500 vlan "amsix"
#
# IP Route Configuration
#
configure iproute add blackhole default
disable icmpforwarding vlan "amsix" 
disable igmp vlan "amsix"

7. Force10 Configuration Hints

There isn't much to configure on Force10 routers. The Network Operations Guide and various pages in the Team Cymru Document Collection provide useful information on Force10 router configuration and management.

! Disable proxy ARP on your AMS-IX interface
Force10(conf)#interface tengigabitethernet 0/0
Force10(conf-if-te-0/0)#no ip proxy-arp

!  Disable IPv6 ND RAs
Force10(conf-if-te-0/0)#ipv6 nd suppress-ra

! The default ARP timeout is 4 hours, but can be changed with this command
Force10(conf)#interface tengigabitethernet 0/0
Force10(conf-if-te-0/0)#arp timeout minutes

8. Foundry Configuration Hints

The following fragment of configuration gives an idea of how to configure a Foundry (BigIron) device. Depending on the actual role of the device (router or switch between router and AMS-IX) and the type of code loaded into the device you may need to mix and match a little here.

! Define a single-port VLAN for the AMS-IX port
vlan number name "AMS-IX" by port
no spanning-tree
untagged ethernet i/f

! Configure the AMS-IX interface
interface ethernet i/f
 port-name "AMS-IX"

! Behave as a router.
 route-only
 no spanning-tree

! Don't do IPv6 ND-RA (Router Advertisements)
 ipv6 nd suppress-ra

! No weird discovery proto, please.
 no vlan-dynamic-discovery

! IP address
 ip address 195.69.14X.Y 255.255.254.0

! No redirects
 no ip redirect
 no ipv6 redirect

! AMS-IX recommends 2 hour ARP timeouts
 ip arp-age 120

! For fast-ethernet: no autoconfig.
 speed-duplex 100-full

9. Juniper Configuration Hints

For Juniper routers, there isn't much to disable. The Juniper Documents from qOrbit Technologies contain useful hints on how to set up your Juniper router.

CautionIGMP Bug (PR/20343) in JunOS versions 5.3R4
 

There's a bug in JunOS versions up to 5.3R4, that will cause a Juniper router to emit IGMP packets on all its interfaces, even when IGMP is disabled. The only way to stop your router from transmitting IGMP is to configure outgoing packet filters on your AMS-IX interface(s).


9.1. Unicast BGP Configuration

Make sure to exchange only unicast routes in the unicast ISP peering LAN by explicitly adding the following statement to all neighbors, groups and prefix-limits:

set family inet unicast

CautionBe thorough with family inet unicast
 

If even one of the neighbors, groups or prefix-limits is defined with a family inet “any”, you'll enable multicast and turn on MBGP.


9.2. IPv4 ARP Cache Timeout

Juniper's default ARP cache timeout is 20 minutes (by comparision: Cisco's default ARP cache timeout is 4 hours which fits AMS-IX's relatively static environment much better).

To reduce the amount of unnecessary broadcast traffic, we recommend setting the ARP cache timeout on Juniper routers to 4 hours. A recipe for this follows:

> configure
Entering configuration mode

[edit]
you@juniper# edit system arp

[edit system arp]
you@juniper# set aging-timer 240

[edit system arp]
you@juniper# show | compare
[edit system arp]
+ aging-timer 240;

[edit system arp]
you@juniper# commit and-quit
commit complete
Exiting configuration mode

9.3. Juniper Aggregated Links

9.3.1. M-Series

We have encountered no issues with aggregated links and JunOS (M40, M160, T320). JUNOS releases prior to 6.0 required VLAN tagging on aggregated interfaces. This limitation has since been removed. An example configuration follows:

---
[edit]
niels@junix# show chassis
aggregated-devices {
    ethernet {
        device-count 1;
    }
}
---
[edit]
niels@junix# show interfaces ge-2/1/0
gigether-options {
    802.3ad ae0;
}

[edit]
niels@junix# show interfaces ge-3/1/0
gigether-options {
   802.3ad ae0;
}
---
[edit]
niels@junix# show interfaces ae0
description "AMS-IX";
unit 0 {
   family inet {
       filter {
           input AMSIX-in;
           output AMSIX-out;
       }
       address 195.69.14x.y/23;
   }
   family inet6 {
       address 2001:07F8:1::A50a:bcde:1/64;
   }
}
---

Additionally and optionally you can configure more granular load balancing:

#

---
routing-options {
    autonomous-system abcde;
    forwarding-table {
        export [ load-balance ];
    }
}
policy-options {
    policy-statement load-balance {
        then {
            load-balance per-packet;
        }
    }
}
forwarding-options {
    hash-key {
        family inet {
            layer-3;
            layer-4;
        }
    }
}
---

In case that is not granular enough, you can modify the hash-key algorithm with some undocumented options in JunOS 7.x and up:

---
hash-key {
    family inet {
        layer-3 {
            destination-address;
            protocol;
            source-address;
        }
        layer-4 {
            destination-port;
            source-port;
            type-of-service;
        }
    }
}
---

Also, you can set your aggregated min-links to a value that will cause the bundle to drop in the event that your links can no longer support the amount of traffic you plan on shoving down the pipe. Thus, 2-port aggregated link, pushing 1.2 Gbps sustained across, drop bundle if n == 1;

---
aggregated-ether-options {
    minimum-links 2;
    link-speed 1g;
}
---

In a situation with load-balancing over multiple IP interfaces (not AMS-IX), the final statement will make traceroute more confusing to novices as packets may seem to "bounce" between interfaces by also including TCP/UDP port numbers and ICMP checksums in the algorithm.

On an IP1 load-balance per-packet really means per-packet; on an IP2 it actually works per flow, which is preferable.


10. Linux Configuration Hints

We are not aware of any major issues with Linux boxes used as routers, and they seem to be pretty rare on the Exchange. Having said that, there are a few parameters that can (and usually should) be tuned:

  1. ARP filtering & source routing

  2. ARP cache timeout

  3. Reverse Path (RP) filter

For more information on tuning your Linux system for routing, see the Linux Advanced Routing & Traffic Control HOWTO.


10.1. ARP Filtering and Source Routing

The Linux approach to IP addresses is that they belong to the system, not any single interface. As a result, Linux hosts have a default behaviour that is different from most other systems: interfaces semi-promiscuously answer for all IP addresses of all other interfaces. Example:

In this example, host tuxco is a Linux box with a peering connection on eth0 (192.168.1.1/24) and a backbone link on eth1 (10.0.0.1/24).

When host kannix (192.168.1.2) sends an ARP query for 10.0.0.1 it will get a reply from tuxco's eth0 interface!

In other words, a Linux host will answer to ARP queries coming in on any interface if the queried address is configured on any of its interfaces. The idea behind this is that an IP address belongs to the system, not just a single interface. Although this may work well for server or desktop systems, it is not desirable behaviour in a router system. One reason is that it is a limited version of proxy-arp, which is forbidden on the AMS-IX peering LAN. Another reason is that two separate routers could potentially answer ARP queries for the same RFC1918 address.


10.1.1. Fixing ARP

The ARP behaviour can be fixed by using arp_ignore and arp_announce on the WAN interface:

tuxco# sysctl -w net/ipv4/conf/eth0/arp_ignore=1
tuxco# sysctl -w net/ipv4/conf/eth0/arp_announce=1

10.2. IPv4 ARP Cache Timeout

The ARP cache timeout on Linux-based routers should be changed from the default, especially if you have a large number of peers. This parameter can be tuned by setting the appropriate procfs variable through the sysctl interface. The Linux arp(7) manual says:

[ … ]

SYSCTLS

ARP supports a sysctl interface to configure parameters on a global or per-interface basis. The sysctls can be accessed by reading or writing the /proc/sys/net/ipv4/neigh/*/* files or with the sysctl(2) interface. Each interface in the system has its own directory in /proc/sys/net/ipv4/neigh/. The setting in the ‘default’ directory is used for all newly created devices. Unless otherwise specified time related sysctls are specified in seconds.

[ … ]

base_reachable_time

Once a neighbour has been found, the entry is considered to be valid for at least a random value between base_reachable_time/2 and 3*base_reachable_time/2. An entry's validity will be extended if it receives positive feedback from higher level protocols. Defaults to 30 seconds.

This means that Linux systems keep ARP entries in their cache for some time between 15 and 45 seconds (and yes, the average works out to 30 seconds). This is not very high. In fact, it is lower than the typical BGP KEEPALIVE interval and may thus result in excessive ARPs.

We suggest a timeout of at least two hours for ARP entries on your AMS-IX interface, so you'd have to set the base_reachable_time to 2 x 2hrs = 4 hours.

tuxco1# sysctl net.ipv4.neigh.ifname.base_reachable_time
net.ipv4.neigh.ifname.base_reachable_time = 30

The above command tells you that the ARP cache timeout is 30 seconds average. To change it so it's between 2 and 6 hours, use the following command:

tuxco1# sysctl -w net.ipv4.neigh.ifname.base_reachable_time=14400
net.ipv4.neigh.ifname.base_reachable_time = 14400

Here ifname is the name of the interface that connects to AMS-IX. You can also use “default” here, but that may have undesired side-effects for your other interfaces.


10.3. IPv6 Neighbor Cache Timeout

As with the IPv4 ARP cache, Linux systems tend to set the lifetime of the IPv6 neighbor cache quite short as well. The lifetime is controlled in a similar way as for IPv4 ARP:

tuxco1# sysctl net.ipv6.neigh.ifname.base_reachable_time
net.ipv6.neigh.ifname.base_reachable_time = 30

tuxco1# sysctl -w net.ipv6.neigh.ifname.base_reachable_time=14400
net.ipv6.neigh.ifname.base_reachable_time = 14400

10.5. Running the “sysctl” Commands at Boot

The various system parameters discussed above can be set at boot time by adding it to a file such as /etc/sysctl.conf. The exact name, location and very existence of this file typically depends on the Linux distribution in use, but both Debian and Red Hat/Fedora use /etc/sysctl.conf:

# file: /etc/sysctl.conf
# These settings should be duplicated for all interfaces that are
# on a peering LAN.

### Typical stuff you really want on a router

# Fix the "promiscuous ARP" thing...
net/ipv4/conf/ifname/arp_ignore=1
net/ipv4/conf/ifname/arp_announce=1

# Turn off RP filtering to allow asymmetric routing:
net/ipv4/conf/ifname/rp_filter=0

# Multiple (non-aggregated) interfaces on the same peering LAN.
# READ THE MANUAL FIRST!
#net/ipv4/conf/ifname/arp_filter=1

### Keep the AMS-IX ARP Police happy. :-)

net/ipv4/neigh/ifname/base_reachable_time=14400
net/ipv6/neigh/ifname/base_reachable_time=14400

CautionModules must be loaded before sysctl is executed
 

On Debian systems, kernel modules for some network interfaces (e.g. 10GE cards) are not loaded before the init process executes the script that runs the sysctl commands. In those cases, it is necessary to force the module to be loaded earlier. The same goes for the IPv6 settings; the ipv6 module is usually not loaded until the network interfaces are brought up, which is typically after the sysctl variables are set by the procps.sh script.

(On Red Hat/Fedora systems no action needs to be taken; the /etc/init.d/network script automatically (re-)sets the sysctl variables before and after bringing up the interfaces.)

There are a few ways around this:

  1. Re-run the sysctl directives after the interfaces are brought up (and the appropriate modules are loaded). This method is probably the only option available to you if your system does no autoloading of modules.

    On Debian-based systems, this can be done by creating a symbolic link in /etc/rc2.d to re-run procps.sh after the network is brought up:

    root@tuxco# ln -s ../init.d/procps.sh /etc/rc2.d/S20procps.sh
  2. Pre-load the appropriate modules before the sysctl settings are applied.

    On Debian-based systems, the necessary modules can be pre-loaded by listing the appropriate modules in /etc/modules. The module-init-tools script (or modutils on older systems) will load the modules before the sysctl.conf entries are executed:

    # file: /etc/modules
    # load the kernel module for "mycard".
    mycard
    # load the ipv6 stack
    ipv6

    (As a curiosity, on Red Hat/Fedora systems this would be accomplished by creating one or more executable scripts in /etc/sysconfig/modules with names ending in .modules. The scripts should be proper shell scripts executing the appropriate commands to load and initialise the modules).

  3. Modify /etc/modprobe.conf (or the appropriate file in /etc/modprobe.d) and use the install directive to execute the relevant sysctl directives after loading the module. Although this is possible, we recommend against it, as it is far easier and clearer to use one of the alternative methods above.


11. Riverstone Configuration Hints

On Riverstone equipment, proxy ARP seems to be enabled by default, so you will need to disable it:

ip disable proxy-arp interface ifname

Here, ifname refers to your interface towards AMS-IX, or the string “all


12. Acknowledgements

Various people contributed to this document. We received configuration info from:

Aaron Weintraub (Cogent Communications)Martin Pels (Support Net)
Andree Toonk (SARA)Miquel van Smoorenburg (Cistron)   
Bart Peirens (Belgacom)Najam Saquib (Mediaways)
Bas Haakman (Multikabel)Niels Raijer (Demon)
Blake Willis (Neo Telecoms)Pierfrancesco Caci (Telecom Italia Sparkle)
Edward Henigin (Giganews)Richard A Steenbergen (nLayer)
Erik Bos (XS4ALL)Ronald Esveld (Equant)
Greg Hankins (Force10)Santi Mercado (SARENET)
Jesper Skriver (TDC)Scott Madley (Level 3 Communications)
Jon Nistor (Rogers/TorIX)Thijs Eilander (Cobweb)
Kevin Day (Your.org)Tom Scholl (SBC)
Lucas van Schouwen (Eweka)Vincent Bourgonjen (Open Peering)
Martijn Bakker (Support Net) 

Thanks to all those who contributed.