Introducing Cisco Programmable Fabric (VXLAN/EVPN)

Introduction to VXLAN/EVPN

Introducing IP Fabric Overlays (VXLAN)

Motivation for an overlay

An overlay is a dynamic tunnel that transports frames between two endpoints. In a switch-based overlay, the architecture provides flexibility for spine switches and leaf switches.

  • Spine switch table sizes do not increase proportionately when end hosts (physical servers and VMs) are added to the leaf switches.

  • The number of networks/tenants that can be supported in the cluster can be increased by just adding more leaf switches.

How this is achieved is explained in detail later.


Note

For easier reference, some common references are explained below:
  • End host or server refers to a physical or virtual workload that is attached to a ToR switch.

  • A ToR switch is also referred as a leaf switch. Since the VTEP functionality is implemented on the ToRs, a VTEP refers to a ToR or leaf switch enabled with the VTEP function. Note that the VTEP functionality is enabled on all leaf switches in the VXLAN fabric and on border leaf/spine switches.


VXLAN as the overlay technology for the Programmable Fabric solution

VXLAN is a MAC in IP/UDP overlay that allows layer 2 segments to be stretched across an IP core. All the benefits of layer 3 topologies are thereby available with VXLAN including the popular layer-3 ECMP feature for efficient traffic spread across multiple available paths. The encapsulation and decapsulation of VXLAN headers is handled by a functionality embedded in VXLAN Tunnel End Points (VTEPs). VTEPs themselves could be implemented in software or a hardware form-factor.

VXLAN natively operates on a flood-n-learn mechanism where BU (Broadcast, Unknown Unicast) traffic in a given VXLAN network is sent over the IP core to every VTEP that has membership in that network. There are two ways to send such traffic: (1) Using IP multicast (2) Using Ingress Replication or Head-end Replication. The receiving VTEPs will decapsulate the packet, and based on the inner frame perform layer-2 MAC learning. The inner SMAC is learnt against the outer Source IP Address (SIP) corresponding to the source VTEP. In this way, reverse traffic can be unicasted toward the previously learnt end host.

Other motivations include:
  1. Scalability — VXLAN provides Layer-2 connectivity that allows the infrastructure that can scale to 16 million tenant networks. It overcomes the 4094-segment limitation of VLANs. This is necessary to address today’s multi-tenant cloud requirements.
  2. Flexibility — VXLAN allows workloads to be placed anywhere, along with the traffic separation required in a multi-tenant environment. The traffic separation is done using network segmentation (segment IDs or virtual network identifiers [VNIs]).

    Workloads for a tenant can be distributed across different physical devices (since workloads are added as the need arises, into available server space) but the workloads are identified by the same layer 2 or layer 3 VNI as the case may be.

  3. Mobility — You can move VMs from one data center location to another without updating spine switch tables. This is because entities within the same tenant network in a VXLAN/EVPN fabric setup retain the same segment ID, regardless of their location.

Overlay example:

The example below shows why spine switch table sizes are not increased due to VXLAN fabric overlay, making them lean.

VM A sends a message to VM B (they both belong to the same tenant network and have the same segment VNI). ToR1 recognizes that the source end host corresponds to segment x, searches and identifies that the target end host (VM B) belongs to segment x too, and that VM B is attached to ToR2. Note that typically the communication between VM A and VM B belonging to the same subnet would first entail ARP resolution.

ToR1 encapsulates the frame in a VXLAN packet, and sends it in the direction of ToR2.

The devices in the path between ToR1 to ToR2 are not aware of the original frame and route/switch the packet to ToR2. .

ToR2 decapsulates the VXLAN packet addressed to it. It does a lookup on the inner frame. Through its end host database, ToR2 recognizes that VM B is attached to it and belongs to segment x, forwards the original frame to VM B.

Figure 1. VXLAN Overlay


  • VXLAN semantics are in operation from ToR1 to ToR2 through the encapsulation and decapsulation at source and destination VTEPs, respectively. The overlay operation ensures that the original frame/packet content is not exposed to the underlying IP network.

  • The IP network that sends packets from ToR1 to ToR 2 based on the outer packet source and destination address forms the underlay operation. As per design, none of the spine switches need to learn the addresses of end hosts below the ToRs. So, learning of hundreds of thousands of end host IP addresses by the spine switches is avoided.

Learning of (hundreds of thousands of) end host IP and MAC addresses

One of the biggest limitations of VXLAN flood-n-learn is the inherent flooding that is required ensuring that learning happens at the VTEPs. In a traditional deployment, a layer-2 segment is represented with a VLAN that comprises a broadcast domain, which also scopes BU traffic. With VXLAN, now the layer-2 segment spans a much larger boundary across an IP core where floods are translated to IP multicast (or HER). Consequently, the flood-n-learn based scheme presents serious scale challenges especially as the number of end hosts go up. This is addressed via learning using a control-plane for distribution of end host addresses. The control plane of choice is MP-BGP EVPN. By implementing MP-BGP EVPN with VXLAN, the following is made possible:
  • End hosts’ information is available to the attached ToR via First Hop Protocols such as ARP/ND/DHCP etc., when a new bare-metal server or VM is attached.

  • End host to ToR mapping information for each ToR is shared with every other ToR using BGP via a route reflector.

  • Specifically, within BGP, the EVPN address family is employed to carry MAC and IP address information of the end hosts along with other information such as the network and tenant (aka VRF) to which they belong. This allows optimal forwarding of both layer-2 and layer-3 traffic within the fabric.

  • VMs belonging to the same tenant might be many hops apart (though assigned with the same segment ID/VNI), and there might be frequent movement and addition of end hosts. When a new VM comes up or is moved between ToRs, the information is instantly updated into BGP by the detecting ToR thereby ensuring that the updated reachability information is also known to every other ToR.

  • In order to accurately route/switch packets between end hosts in the data center, each participating ToR in a VXLAN cluster must be aware of the end hosts attached to it and also the end hosts attached to other ToRs, in real time.

VXLAN-EVPN fabric — The overlay protocol is VXLAN and BGP uses EVPN as the address family for communicating end host MAC and IP addresses, so the fabric is referred thus.

More details for MP-BGP EVPN are noted in the Fabric Overlay Control-Plane (MP-BGP EVPN) section

Realizing Layer-2 and Layer-3 Multi-Tenancy

Using segment IDs or VNIs for multi tenancy in the VXLAN fabric

Typically, when a tenant is created, it is assigned a unique VNI referred to as the layer-3 VNI or the layer 3 segment ID. This serves as a unique identifier for tenant layer-3 context also referred to as the tenant VRF. For each network created within the tenant, a unique identifier is assigned which is referred to as the layer-2 VNI or layer-2 segment-id. The VNIs all come from the same 2^24 – 1 pool represented by the 24-bit VNI identifier carried in the VXLAN header.

Figure 2. VXLAN Packet Format


Some Segment ID/VNI pointers are given below:

  • If a new VM or physical server for this tenant is added to the data center, it is associated with the same layer-3 VNI, regardless of the physical location. In addition, if it is part of a given tenant network, it is assigned the same layer-2 VNI that identifies that network.

  • By confining server and end host identification of a specific tenant to a unique VNI (or few unique VNIs), segmentation and security are ensured.

  • By ensuring that the VNI-to-end host mapping information on each ToR is updated and shared through the route reflector, the latest information is available through the VXLAN setup.

  • Routing at the ToR/access layer facilitates a more scalable design, contains network failures, and enables transparent mobility.

Traffic between servers in the same tenant network that is confined to the same subnet is bridged. In this case, the VTEPs stamp the layer-2 VNI in the VXLAN header when the communication is between servers that are below different ToRs. The forwarding lookup is based on (L2-VNI, DMAC). For communications between servers that are part of the same tenant but belong to different networks, routing is employed. In this case, the layer-3 VNI is carried in the VXLAN header when communication is between servers below different ToRs. This approach is referred to as the symmetric IRB (Integrated Routing and Bridging) approach, the symmetry comes from the fact that VXLAN encapsulated routed traffic in the fabric from source to destination and vice-versa will carry the same layer-3 VNI. This is shown in the figure below.

Figure 3. Inter Tenant Traffic Flow Using VRF VNI

In the above scenario, traffic from a server (with layer-2 VNI x) on VTEP V1 is sent to a server (with layer-2 VNI y) on VTEP V2. Since the VNIs are different, the layer-3 VNI (unique to the VRF) is used for communication over VXLAN between the servers.

Fabric Overlay Control-Plane (MP-BGP EVPN)

The main motivations for using BGP EVPN as the control plane are:

  • Standards based —The overlay (VXLAN) and the control plane (BGP) are standards based.

  • Implement control-plane MAC learning so that VMs/servers for each tenant have a unique identity across the fabric.

    In a VXLAN-EVPN based fabric, MAC learning occurs via the control plane [through multi-protocol (MP) BGP] instead of the data plane.

    When a new end host is attached to a VTEP (aka ToR), the VTEP advertises the MAC and IP address of the end host to a route reflector which in turn advertises it to the other VTEPs through MP-BGP (as shown in the image below). Since MP-BGP enables isolation of groups of interacting agents, VMs/servers that belong to the same tenant are logically isolated from other tenants.

Figure 4. End Host IP + MAC Address Distribution in a VXLAN Setup


The motivations for using BGP EVPN continues below:

  • Reduce flooding
    • Since the number of end hosts attached to VTEPs in a data center is huge, a mechanism is required to reduce flooding for discovery of end host location and resolution information. This is achieved via MAC/IP binding information distribution.

    • MAC address distribution eliminates (or reduces) unknown unicast flooding because MAC addresses are prepopulated.

    • MAC to IP binding information helps in local ARP suppression.

  • Distributed anycast gateway
    • For a given subnet, the same default gateway with the same IP and MAC address is realized simultaneously on appropriate ToR switches thereby ensuring the default gateway for the end hosts is always at its closest point aka its directly attached switch.

    • This ensures that routed traffic is also optimally forwarded within the fabric without going through any tromboning.

  • VM Mobility Support
    • The control plane supports transparent VM mobility within and across VXLAN BGP EVPN fabrics, and quickly updates reachability information to avoid hair-pinning of east-west traffic.

    • The distributed anycast gateway also aids in supporting transparent VM mobility since post VM move, the ARP cache entry for the default gateway is still valid.

  • Efficient bandwidth utilization and resiliency with Active-Active multipathing

    VXLAN is supported with virtual PortChannel (vPC). This allows resiliency in connectivity for servers attached to access switches with efficient utilization of available bandwidth. VXLAN with vPC is also supported for access to aggregation (leaf switch to spine switch) connectivity, promoting a highly available fabric.

  • Secure VTEPs

    In a VXLAN-EVPN fabric, traffic is only accepted from VTEPs whose information is learnt via the BGP-EVPN control plane. Any VXLAN encapsulated traffic received from a VTEP that is not known via the control plane will be dropped. In this way, this presents a secure fabric where traffic will only be forwarded between VTEPs validated by the control plane. This is a major security hole in data-plane based VXLAN flood-n-learn environments where a rogue VTEP has the potential of bringing down the overlay network.

  • BGP specific motivations
    • Increased flexibility— EVPN address family carries both Layer-2 and Layer-3 reachability information. So, you can build bridged overlays or routed overlays. While bridged overlays are simpler to deploy, routed overlays are easier to scale out.

    • Increased security— BGP authentication and security constructs provide more secure multi-tenancy.

    • Improved convergence time— BGP being a hard-state protocol is inherently non-chatty and only provides updates when there is a change. This greatly improves convergence time when network failures occur.

    • BGP Policies— Rich BGP policy constructs provide policy-based export and import of reachability information. It is possible to constrain route updates where they are not needed thereby realizing a more scalable fabric.

    • Advantages of route reflectors— Increases scalability and reduces the need for a full mesh (coverage) of BGP sessions.

      A route reflector in an MP-BGP EVPN control plane acts as a central point for BGP sessions between VTEPs. Instead of each VTEP peering with every other VTEP, the VTEPs peer with a spine device designated as a route reflector. For redundancy purposes, an additional route reflector is designated.

End Host and Subnet Route Distribution

Some pointers about end host MAC and IP route distribution in a VXLAN EVPN fabric are given below:

  • When a new end host is attached to a VTEP (say Host A in the below scenario), the VTEP V1 learns the end host’s MAC and IP address. MP-BGP on the VTEP nodes enables advertising of the addresses (IP + MAC) to the route reflector.

    Figure 5. New End Host Attaches to a VTEP


  • MP-BGP also distributes
subnet routes and external reachability information between VTEPs. When VTEPs obtain end host routes of remote end hosts attached to other VTEPs, they install the routes in their RIB and FIB.

Note that the end host route distribution is decoupled from the underlay protocol.

End host communication within a VNI and across VNIs

As we know, in a VXLAN EVPN fabric a unique Layer-2 VNI is designated to each tenant network.

Recall, when two end hosts sharing the same layer-2 VNI communicate with each other, the traffic is bridged, and confined to a subnet. When two end hosts in different layer-2 VNIs communicate with each other, the traffic is routed and moves between subnets.

Furthermore, an end host retains its address and tenant association when it moves to another VTEP.

One tenant network, one Layer-2 VNI, and one default gateway IP and MAC address

Since end hosts in a tenant network might be attached to different VTEPs, the VTEPs are made to share a common gateway IP and MAC address for intra-tenant communication.

If an end host moves to a different VTEP, the gateway information remains the same and reachability information is available in the BGP control plane. 


Distributed IP anycast gateway

The gateway is referred as a distributed IP anycast gateway, since the gateway is distributed across all relevant VTEPs.

The gateway provides routing and bridging capabilities, and the mechanism is referred as Integrated Routing and Bridging (IRB).

The distributed anycast gateway for routing is completely stateless and does not require the exchange of protocol signalization for election or failover decision.

All VTEPs host active default gateways for their respective configured subnets and First Hop Routing Protocols (FHRP) such as HSRP, VRRP etc. are not needed.

A sample distributed gateway for a setup, and the associated configurations are given below:

Figure 6. Distributed Gateway


Blue tenant network's configuration

Similar configuration needs to be implemented on the leaf switch V1 and all other switches containing the blue tenant network’s end hosts. Since VLAN configuration is local to the switch, you can configure the same VLAN ID for other switches or a different one.

VLAN to VNI mapping (MT-Lite)

(config) #


vlan 43 
    vn-segment 30000

The anycast gateway MAC, inherited by any interface (SVI) using “fabric forwarding”


(config) #


fabric forwarding anycast-gateway-mac 2020.0000.00aa  

Distributed IP anycast gateway (SVI) 


(config) #


interface vlan 43
  no shutdown
  vrf member VRF-A
     ip address 10.11.11.1/24 tag 12345
     fabric forwarding mode anycast-gateway

Red tenant network's configuration

VLAN to VNI mapping (MT-Lite)

(config) #


vlan 55 
  vn-segment 30001 

The anycast gateway MAC, inherited by any interface (SVI) using “fabric forwarding”

(config) #


fabric forwarding anycast-gateway-mac 2020.0000.00aa 

Distributed IP anycast gateway (SVI)

(config) #


interface vlan 55
  no shutdown
  vrf member VRF-A
     ip address 10.98.98.1/24 tag 12345
     fabric forwarding mode anycast-gateway

In the above example, a gateway is created for each of the 2 tenant networks (Blue –L2 VNI 30000 and Red – L2 VNI 30001). End host traffic within a VNI (say 30000) is bridged, and traffic between tenant networks is routed. The routing takes place through a Layer-3 VNI (say 50000) typically having a one-on-one association with a VRF instance.

Forwarding between servers within a Layer-2 VNI

Figure 7. Packet Forwarding (Bridge)

The VNI of the source end host, Host A, and the target end host, Host B, is 30000.

  1. Host A sends traffic to the directly attached VTEP V1.

  2. V1 performs a lookup based on the destination MAC address in the packet header (For communication that is bridged, the target end host’s MAC address is updated in the DMAC field).

  3. VTEP V1 bridges the packets and sends it toward VTEP V2 with a VXLAN header stamped with the Layer 2 VNI 30000.

  4. VTEP V2 receives the packets, and post decapsulation, lookup, bridges them to Host B.

Packet forwarding between servers belonging to different Layer-2 VNIs

In the below example, the source and target end hosts (Host A and Host F) belong to different Layer-2 virtual networks (with VNIs 30000 and 30001). So, the traffic flow is between subnets, and hence routed. The VRF VNI 50000 is used to route the traffic

Figure 8. Packet Forwarding (Route)


A high level overview of the flow is given below:
  1. Host A sends traffic to its default gateway (post ARP resolution) which is configured on the directly attached VTEP V1.

  2. V1 performs a FIB lookup based on the destination IP address in the packet header.

  3. VTEP V1 routes the packets and sends it toward VTEP V2 with a VXLAN header stamped with the VRF (Layer 3) VNI 50000.

  4. VTEP V2 receives the packets, and post decapsulation, routing lookup, and rewrite, sends them to Host F.

Sample configurations for a setup with VNIs 30000 and 30001 are given below:

Configuration Example for VLAN, VNI, and VRF

VLAN to VNI mapping (MT-Lite)

(config) #


vlan 43 
 vn-segment 30000 

vlan 55
 vn-segment 30001

Allocate VLAN for VRF VNI

(config) #


vlan 500 
 vn-segment 50000

VRF configuration for "customer" VRF

(config) #


vrf context VRF-A
   vni 50000
   rd auto
   address-family ipv4 unicast
      route-target both auto evpn

Configuration Example for MP-BGP EVPN

BGP Configuration for VRF (routing)

(config) #


router bgp 65535
  address-family ipv4 unicast
  neighbor RR_IP remote-as 65535
    address-family ipv4 unicast
    address-family l2vpn evpn
      send-community extended

(config) #


vrf VRF-A
  address-family ipv4 unicast
    advertise l2vpn evpn

EVPN Configuration for VNI (bridging)

(config) #


evpn
  vni 30000 l2
    rd auto
    route-target both auto

Type Exit twice.

(config-evpn) #


  vni 30001 l2
   rd auto
   route-target both auto

Routing at the VTEP - A high level view

Mandatory configurations

  1. A VLAN is configured for each segment - sending segment, VRF segment and receiving segment.

  2. BGP and EVPN configurations ensure redistribution of this information across the VXLAN setup.

Real time behavior

  1. The source VTEP receives traffic and takes the routing decision. It then stamps the packet with the associated VRF VNI while sending traffic to the destination VTEP, which in turn forwards traffic to the destination server

Communication between a VXLAN overlay and an external network

The data center interconnect (DCI) functionality is implemented on the border device (leaf or spine) of the VXLAN EVPN network. Depending on the type of hand-off to the outside network such as MPLS, LISP, layer-2, and so on, appropriate DCI configuration is required on the border device(s) and the connecting edge device(s) of the outside network.

The DCI functionality is detailed in the External Connectivity chapters/use cases.

ARP Suppression

The following section illustrates ARP suppression functionality at VTEP V1 (Refer the ARP Suppression image, given below). ARP suppression is an enhanced function configured under the layer-2 VNI (using the suppress-arp command). Essentially, the IP-MACs learnt locally via ARP as well as those learnt over BGP-EVPN are stored in a local ARP suppression cache at each ToR. ARP request sent from the end host is trapped at the source ToR. A lookup is performed in the ARP suppression cache with the destination IP as the key. If there is a HIT, then the ToR proxies on behalf of the destination with the destination MAC. This is the case depicted in the below image.

In case the lookup results in a MISS, when the destination is unknown or a silent end host, the ToR re-injects the ARP request received from the requesting end host and broadcasts it within the layer-2 VNI. This entails sending the ARP request out locally over the server facing ports as well as sending a VXLAN encapsulated packet with the layer-2 VNI over the IP core. The VXLAN encapsulated packet will be decapsulated by every receiving VTEP that has membership within the same layer-2 VNI. These receiving VTEPs will then forward the inner ARP frame toward the server facing ports. Assuming that the destination is alive, the ARP request will reach the destination, which in turn will send out an ARP response toward the sender. The ARP response is trapped by the receiving ToR, even though ARP response is a unicast packet directed to the source VM, since the ARP-suppression feature is enabled. The ToR will learn about the destination IP/MAC and in turn advertise it over BGP-EVPN to all the other ToRs. In addition, the ToR will reinject the ARP response packet into the network (VXLAN-encapsulate it toward the IP core since original requestor was remote) so that it will reach the original requestor.

Figure 9. ARP Suppression

Unknown unicast (packet) suppression

Typically, an unknown unicast scenario arises when an end host has resolved the ARP but the MAC address of the end host is not available/updated in the switch.

Unknown unicast traffic from an end host is by default flooded in the VLAN. It is possible to avoid the flooding of this traffic to the overlay network without affecting the flooding of this traffic on local host/sever ports attached to the ToR switch. Use the suppress-unknown-unicast command to do the same.

The suppress unknown unicast function is supported on ToRs/VTEPs in a VXLAN EVPN fabric. This function allows flooding of traffic within the attached switch by including local host/server ports attached to the ToR switch in the output interface index flood list (OIFL) and excluding overlay Layer-3 ports in the hardware.

Performing End Host Detection, Deletion and Move

End host detection by a VTEP device

Figure 10. End Host Detection


When a new end host (Host A) is attached to VTEP V1, the following actions occur:

  1. VTEP V1 learns Host A's MAC and IP address (MAC_A and IP_A).

  2. V1 advertises MAC_A and IP_A to the other VTEPs V2 and V3 through the route reflector.

  3. The choice of encapsulation (VXLAN) is also advertised.

A sample depiction of Host A related information is given below:

Table 1. Host A - Address Distribution Parameters

MAC, IP

L2VNI

L3VNI

Next-Hop

Encapsulation

Sequence

MAC_A, IP_A

30000

50000

IP_V1

VXLAN

0

Figure 11. Host move from V1 to V3


If Host A moves from VTEP V1 to V3, the following actions occur:

  1. V3 detects Host A and advertises it with Sequence 1 (updating the previous instance of the sequence, 0). The next hop IP address is reassigned to that of VTEP 3.

    Table 2. Host A – Updated Parameters

    MAC, IP

    L2VNI

    L3VNI

    Next-Hop

    Encapsulation

    Sequence

    MAC_A, IP_A

    30000

    50000

    IP_V3

    VXLAN

    1

  2. VTEP V1 detects a more recent route and withdraws its advertisement.

Mobility in a vPC scenario

In a vPC scenario where 2 ToR switches are vPC peers, whether the end host is attached to an orphan port or has a dual homed connection, the VIP address is advertised in the control plane and data plane, and the VIP address is carried in the (outer) source IP address field of the VXLAN packet.


Note

VIP is a common, virtual VTEP IP address that is used for (unicast and multi destination) communication to and from the two switches in the vPC setup. The VIP address represents the two switches in the vPC setup, and is designated as the next hop address (for end hosts in the vPC domain) for reachability purposes.


Mobility across fabrics

To move end hosts (VMs) between VXLAN BGP EVPN fabrics, the fabrics should be connected through a DCI functionality. The various DCI functions are detailed in the External Connectivity chapters/use cases.

Multi-Destination Traffic

There are two options to transport tenant multi-destination traffic in the Programmable Fabric:

  1. Through a shared multicast tree using PIM (ASM, SSM or BiDiR).

  2. Through ingress replication (Available for Cisco Nexus 9000 Series switches only).

Refer to the table for Nexus switch type-to-BUM traffic support option mapping:

If you are using this Nexus switch:

Use this option for BUM traffic:

Cisco Nexus 9000 Series

Traffic ingress replication or PIM ASM/SSM/BiDir

Note 

Beginning in Cisco NX-OS Release 9.2(1), PIM BiDir is supported.

Cisco Nexus 7000 and 7700 /F3 Series

PIM ASM/SSM or PIM BiDir

Cisco Nexus 5600 Series

PIM BiDir