Firmware Development Kit Version 1.x

warn.gif

Warningblank.gifThis version of the FDK is being superseded by version 2.x, however this page will be left here for historical reference. The latest version of the FDK that this page is applicable to is 1.3.5.


Please refer this page for current FDK documentation.

Overview

The Cisco Nexus SmartNIC (formerly ExaNIC) FPGA development kit unlocks the FPGA technology within the SmartNIC, allowing customers to develop applications that run directly within the network card firmware. This allows for a number of interesting applications, some of which are demonstrated in examples provided with the development kit. The following examples come with the development kit, including the requisite source code for each:

  • A 'fast responder' example shows how to pre-load the card with a reply ahead of time, and send it based on a simple mask/pattern match over received frames.
  • A 'ping example' demonstrates various functionality, including sending frames directly from the card, making use of hardware timestamping, and using custom frames to communicate with software.
  • A 'steering example' demonstrates how to perform user-defined flow steering. A simple destination IP based flow steering example is provided, which can easily be modified to perform steering based on application layer information.
  • A 'bridging example' demonstrates how to bridge two ports together, such that traffic received on one port is transmitted out of another.
  • A 'loopback example' allows users to benchmark the MAC latency of the SmartNIC. This example simply sends a response packet on receipt of the first byte off the wire.

Installation

Prior to using the development kit customers must install Xilinx Vivado 2016.4 or later, which can be obtained from the Xilinx website.

Xilinx added support for the XCKU035 FPGA used in the Cisco Nexus SmartNIC K35-S (formerly X10) and Cisco Nexus SmartNIC K35-Q (formerly X40) to the free WebPACK license, so a paid Vivado license is not required to use the FDK on these cards.

The SmartNIC development kit ships as a tar file that contains a project directory structure. Untar the project directory structure to a convenient location.

Build System

The SmartNIC development kit ships with a build system for five fully functional example applications. The build system consists of a Makefile and a Vivado TCL script (compile.tcl). The Makefile launches Vivado and instructs it to run the TCL script. All project source files and the Vivado build process are defined within the TCL script.

The Makefile defines a number of targets and hardware platforms, which can be set by passing variables into the make command like so:

$ make TARGET=trigger PLATFORM=x10

The valid targets are trigger, ping, steer, bridging, loopback and the two hardware platforms are x10 and x40. In addition to the targets and platforms, there are also two build options:

  • NOREBOOT=1 will disable FPGA reload when the PCIe reset line is asserted. This is useful if you want to load a bitstream onto the FPGA via JTAG, then perform a system reboot.
  • NOTANDEM=1 will disable Xilinx tandem boot mode. Tandem has occasionally shown issues that result in systems failing power on self test for some builds. If you encounter any such issues building with this option may assist.

The build system will generate a number of files in the outputs/ directory, including a standard SmartNIC firmware image with a .fw extension that be can be flashed to a SmartNIC with the exanic-fwupdate utility.

Running exanic-config after flashing & rebooting will result in something similar to the following:

$ exanic-config
Device exanic0:
 Hardware type: ExaNIC X10
 Board ID: 0x00
 Temperature: 70.6 C   VCCint: 0.94 V   VCCaux: 1.86 V
 Function: customer application
 Firmware date: 20170106 (Fri Jan  6 01:30:05 2017)
 Customer version: 1485732321 (588f3d31)

The Firmware date listed above is the date this FDK was built by Cisco. The Customer version is the date/time the customer built this image. The date command can be used to convert this number to a human readable form if required:

$ date -d @1485732321
Mon Jan 30 10:25:21 AEDT 2017

Transmit Buffer Size

The "base" logic provided with the FDK (eg PCS/MAC) uses a fixed portion of the FPGA's resources. For example, the trigger_example built for the SmartNIC K35-S (X10) uses approximately 10% of total BRAMs, and when built for the SmartNIC K35-Q (X40) this increases to approximately 20%.

These BRAMs are used, in part, as transmit buffers for host software where packets are staged in the FPGA prior to transmission onto the network.

The default (per port) transmit buffer size is as follows:

  • SmartNIC K35-S (X10) stock firmware: 128 kByte
  • SmartNIC K35-Q (X40) stock firmware: 64 kByte
  • SmartNIC K35-S (X10) FDK: 32 kByte
  • SmartNIC K35-Q (X40) FDK: 32 kByte

The transmit buffer size is reported by exanic-config when passing the verbose (-v) flag:

$ exanic-config exanic0 -v
Device exanic0:
  Hardware type: ExaNIC X10
  Board ID: 0x00
  Temperature: 50.0 C   VCCint: 0.95 V   VCCaux: 1.85 V
  Function: network interface
  Firmware date: 20170116 (Mon Jan 16 22:01:01 2017)
  PPS out: disabled
  Port 0:
    Interface: enp1s0
    Port speed: 10000 Mbps
    Port status: enabled, no SFP, no signal, no link
    MAC filters: 64  IP filters: 128
    TX buffer size: 128kB
    MAC address: 64:3f:5f:01:29:32
    RX packets: 31019943  ignored: 0  error: 0  dropped: 0
    TX packets: 3000605026

These transmit buffer sizes are not scalable by customers using the FDK, however Cisco can rebuild the FDK with smaller/larger sizes on request. For example, if you needed more BRAMs for your custom logic and are prepared to have smaller TX buffers available for host software, we can shrink the buffers down to 16 kByte (per port). For architectural reasons it's not possible to reduce them down beyond 16k.

Licensing

The SmartNIC development kit can ship as a time limited demonstration version. After two hours of operation, features of the SmartNIC will progressively shut down and stop working. After this time the host can be rebooted/power cycled to reset the two hour timer. Should you wish to purchase a full license and have this time limitation removed, contact your Cisco sales representative to discuss licensing options.

Interfaces

The SmartNIC development kit provides full access to all network transmit and receive datapaths, as well as a register and memory space that can be accessed by the user's software application. At the top level of the design hierarchy exanic_devkit.v wraps both the SmartNIC IP core netlist and the custom user application. This top level wrapper provides connections between the SmartNIC netlist and the user's application. The example designs provide these connections and can be used as a starting point for adding further functionality.

459813.png

The user interface has the following clocking reset and status signals as inputs:

  • clk_net (1 bit), a 161 MHz clock generated from the Ethernet transmit clock. All signals with the _net suffix are synchronous to this clock.
  • clk_host (1 bit), a 250 MHz clock generated from the PCIE bus clock. All signals with the _host suffix are synchronous to this clock.
  • rst_n (1 bit), an enable line asserted soon after the clock is valid and present.
  • hw_time_net (32 bit), a counter that is shared with the timestamp counter for received packets, having 6.2 ns resolution. This counter is synchronous to the clk_net clock.
  • hw_time_host (32 bit), this is simply hw_time_net crossed into the host clock domain (clk_host).
  • devkit_version (32 bit) provides a UNIX timestamp indicating the time at which the SmartNIC netlist was built at the factory.

The user register interface allows the user application to implement up to 2048 readable and/or writeable 32 bit registers. On this interface, reads and writes happen a full 32 bit word at a time, with no individual byte enables. All signals in this section are synchronous to clk_host. This interface is implemented using the following signals:

  • reg_w_en (1 bit), asserted on the same cycle as reg_w_addr and reg_w_data to indicate a register write request from the host.
  • reg_w_data (32 bit), the write data from the host.
  • reg_w_addr (11 bit), the address of the register the host wants to write to. This address increments for each 32 bit word, and is not a byte offset.
  • reg_r_addr (11 bit), the address of the register the host wishes to read. As with the write address, this address increments for each 32 bit word.
  • reg_r_en (1 bit), a read enable signal asserted with reg_r_addr that indicates the address is valid.
  • reg_r_data (32 bit) is the data for the register selected by reg_r_addr. Data must be provided when reg_r_ack is asserted.
  • reg_r_ack (1 bit) must be asserted when reg_r_data is valid in response to a read. The user logic has 16 cycles in which to assert reg_r_ack in response to reg_r_en before the read times out. The PCIe logic will reply with an unsupported request TLP on timeout.

The user memory interface allows the application to implement a write-only (for the host) memory space. Reading back of this memory by the host is not supported. This can be useful for the implementation of transmit buffers and maps well to block memories. All writes are performed synchronous to clk_host. This interface is implemented using the following signals:

  • mem_w_en (32 bit), 32 bit write byte enable, asserted for each byte offset from mem_w_addr that the host wishes to write to. The LSB (bit 0) of the write enable signal refers to the byte at offset 0 from the write address.
  • mem_w_addr (19 bit), the memory offset at which the host intends to write to. This is the DWORD offset (32 bit) from the development kit region in BAR2.
  • mem_w_data (256 bit), up to 32 bytes of data (selected by the write enables) that the host wishes to write.

Note that the memory interface is always 'address aligned'. This means that in practice, mem_w_addr[2:0] is always zero, and the byte enables must be used to determine which bytes will be written to.

The user application can monitor all received frames and tag, modify, or drop them prior to them being transferred to the host. All received data is provided synchronous to clk_net. The first byte seen by the user application will be the first byte of the destination MAC address - the SFD and preamble are not provided. The receive data is provided via the following signals:

  • rx_data_net (64 bits per port), 8 data bytes as received obtained from the wire. Byte 0 (bits 7 to 0) is the first to be received off the wire.
  • rx_sof_net (1 bit per port), asserted on the same cycle as the first 8 bytes received from the wire. The 64 bits of data will contain the 6 byte destination MAC address and the first 2 bytes of the source MAC address.
  • rx_eof_net (1 bit per port), asserted on the very last cycle of a received frame. The last bytes seen will include the four byte received CRC.
  • rx_len_net (3 bit per port), asserted on the same cycle as the EOF, indicates how many bytes in the final data signal are valid. As an example, if this reads 1, then only the bottom byte of data (bits 7 to 0) is valid.
  • rx_vld_net (1 bit per port), asserted to indicate that receive data is valid. Due to the overhead of the 64b/66b encoding used in 10G ethernet, there may be cycles intra-frame that do not contain valid data.
  • rx_err_net (1 bit per port), asserted to indicate an abnormal frame termination condition. This can occur when the sender aborts the frame early, or if the link is lost in the middle of a frame. If a frame is terminated with rx_err_net, there will be no rx_eof_net for that frame.
  • rx_crc_fail_net (1 bit per port), will be asserted on the same cycle as EOF to indicate that the CRC does not match the received data.
  • rx_timestamp_net (32 bit per port), is a counter that serves as the timestamp for the first byte of the received frame with 6.2ns of resolution.

Note that the width of each of the above signals scales with the number of ports. To select the set of signals for a given port, use bit slicing. For example, RX data for port 0 will occupy rx_data_net[63:0], and RX data for port 1 will occupy rx_data_net[127:64]. The example designs show how to perform this bit slicing or indexing for each of the signals on this bus. This note applies to all of the ethernet frame interfaces in the FPGA development kit.

note.gif

Noteblank.gifA minimum of 1 spare cycle is required between EOF and SOF being asserted. Normally frames coming off the wire will have at least this (even at full line rate, due to ethernets minimum Inter-Frame Gap). However if the user application is generating frames to send up to the host in addition to those coming off the wire then this requirement must be observed by the user logic.


The receive net signals described above are all inputs, but the user application has a duplicate set of signals for the receive path that are outputs. These signals mirror the above signal names, but are suffixed with _host instead of _net. These host signals are synchronous to clk_host. These outputs map directly to the ethernet frame DMA engines for each port. To allow the port to function as a regular ethernet interface, the user can simply connect the net interface to its corresponding host interface using the provided asynchronous FIFO. There are two additional host signals:

  • rx_match_host (8 bit per port), allows the user application to tag frames with an 8 bit code with application specific meaning. This code will be provided in the information section of each chunk of the frame that is transferred to the host.
  • rx_buffer_host (6 bit per port), allows the user application to steer frames to different userspace buffers on the host system. For more information on custom flow steering, see the flow steering example design section of this document.

The example in the figure below shows a frame as it traverses the card frame interfaces. This frame has a destination MAC of 64:3F:5F:01:00:00 and a source MAC of 00:0F:53:01:01:01. Also, the ethertype value, 0x0800, indicates that it is an IP frame. Notice that the SOF flag is asserted only on the first beat, and the EOF and LEN flags are set on the last beat of the frame. The LEN field indicates that 3 bytes of the frame are valid on the last beat.

459810.png

An important change relative to the X2 and X4 FPGA development kits is the interpretation of len = 0 during the end of frame. This condition now means that no bytes are valid. This is different from the X2 and X4 kits, where this condition was used to signal that 8 bytes were valid.

The host transmit interface is also exposed to the SmartNIC development kit application. The user application can monitor and modify frames that are being transmitted by the host, as well as transmit frames of its own. Ethernet frames transmitted by the user application must start with the first byte of the destination MAC address, and end at the last byte of the payload. Logic within the SmartNIC automatically calculates, appends and transmits the CRC. The FPGA application has the following signals which connect through to the ethernet transmission logic, all of which are synchronous to clk_net:

  • tx_data_net (64 bits per port), the data to be transmitted, 8 bytes at a time. The first byte that will be placed on the wire is located at bits 7 to 0.
  • tx_sof_net (1 bit per port), to be asserted on the same cycle as the first 64 bits of data.
  • tx_eof_net (1 bit per port), to be asserted on the same cycle as the last 64 bits of transmitted data.
  • tx_len_net (3 bits per port), to be set on the same cycle as EOF, indicating the number of bytes of data valid in the last cycle.
  • tx_vld_net (1 bit per port), asserted by the user application when the transmit data output is valid. Once tx_sof_net is asserted, this signal must remain high until the end of frame.
  • tx_ack_net (1 bit per port), acknowledge signal provided to the user application. The SmartNIC can be considered to have read transmit data from the application for any rising clock edge during which 'ack' and 'vld' are simultaneously high.
  • tx_err_net (1 bit per port), is an output for the user application. This signal allows the user application to send a frame with an incorrect FCS. If this signal and tx_ack_net are high for one clock cycle while transmitting a frame, the frame will have an incorrect FCS.

A corresponding set of signals provide frames transmitted by the host. These have the same names as the above signals, but with _net replaced with _host. The host transmit signals are driven by the SmartNIC transmit engine, and provide access to frames sent from software. The host transmit signals are synchronous to clk_host. A port will act as a normal ethernet interface if the net signals are directly connected to the corresponding host signals using the provided asynchronous FIFO.

Normally, the SmartNIC transmit engine will pad frames sent down from host software that are below the minimum frame size (<64 bytes). The user application can elect to disable this padding on a per port and per frame basis by asserting the disable_tx_padding flag. The flag is sampled at each SOF.

The example design provides code that shows how to multiplex FPGA generated frames with the host data path. It also provides code for a state machine that conforms to this interface, reading frames from block RAM.

Software Integration

Low Level

The user application can interface with software via its address space, as well as via modifying and tagging received packets prior to them being transferred to the host. In the first instance, pointers to the register and memory address space can be obtained using libexanic, calling:

  • exanic_get_devkit_registers() to get a pointer to unsigned 32 bit values the register space, and
  • exanic_get_devkit_memory() to get a pointer to byte values in the memory space.

The value and meaning of the registers and memory in these address spaces are dependent on the user's FPGA application.

Utilities for reading and writing to the user register space are provided in the examples/devkit directory. For example, in trigger_example.v of the FDK, the registers are defined like so:

/* Register reads. */
always @ (posedge clk_host) begin
  reg_r_ack <= reg_r_en;
  case (reg_r_addr)
    'h0:    reg_r_data <= FIRMWARE_ID;
    'h1:    reg_r_data <= VERSION;
    'h2:    reg_r_data <= armed;
    'h4:    reg_r_data <= match_length;
    ...
    ...

FIRMWARE_ID is defined to be 32'hEB000001, so reading register 0 yields:

$ ./exanic-devkit-register-read exanic0 0
0x000: 0xEB000001 (-352321535)

The user application can also communicate with the host via dummy ethernet frames. An example of this is shown in the ping example application, where a dummy frame with a custom ethertype is DMA transferred to the host. This frame is received using libexanic and contains user-defined data.

TCP Stack Integration

The SmartNIC driver package includes support for exasock extensions. These extensions allow applications to obtain the next set of TCP headers for a particular socket. When used in conjunction with the development kit, these functions allow the host to manage TCP state (through transparently bypassed kernel sockets, via exasock) and allow the card to send 'fast' responses in response to user defined events.

Within the driver source tree, the exasock-tcp-responder-example.c example application shows how to use these functions with the trigger example firmware. This example shows how normal UNIX socket calls can be used to make a TCP connection to a server, with the card sending a TCP reply in response to a received UDP packet.

Included Cores

The SmartNIC FPGA development kit ships with source code for IP cores that are useful for performing common tasks.

Field Extract (field_extract.v)

The field extract core can be used to extract an arbitrary length field from received frames. To use the core, instantiate it by specifying the following two parameters:

  • BYTES: The byte width of the field to extract.
  • OFFSET: The offset in bytes of the field in the frame, measured from the start of the frame.

The core can be wired directly to the development kit frame interfaces via its data, sof and vld inputs. The field extract core will strobe the field_vld output for one clock cycle when the field output contains the value of field in the currently received frame.

Examples of using this core are shown in the ping and flow steering example applications.

Frame Mux (frame_mux.v)

The frame mux core provides a way to share a single frame output interface (for example, rx_host or tx_usr) between two sources of frames. It provides buffering so that interfaces that cannot be 'stalled', such as the receive interface, can be arbitrated without loss of data. A typical application is shown in the ping example application, where the frame mux is used to share the host DMA datapath between received frames and FPGA generated frames.

The frame mux also allows two ports to be 'bridged' together, much like the SmartNIC bridging functionality. As an example, the frame mux can be used to connect port 0 receive to port 1 transmit, whilst also allowing the host to transmit via port 1. In this mode of operation, the frame mux has an optional FCS removal mode. This is required because received packets are provided to the user application with the FCS present, however the FCS must be removed prior to passing them to the transmit interfaces.

The frame mux core has the following parameters:

  • DEPTH: The total buffering depth of the two FIFOs contained within the frame mux. This is the maximum number of QWORDs that the frame mux can store.
  • IN0_DELAY, IN1_DELAY: The amount of 'prebuffering' to apply to a particular input of the mux, prior to providing it to the output. This is useful when connecting the receive of one port to the transmit of another, since Ethernet clock mismatch may result in transmitter starvation unless enough of the packet is available in a buffer prior to beginning the transmit process.
  • STRIP_FCS0, STRIP_FCS1: selects whether to remove the last 4 bytes from a particular input. Useful for removing the FCS from a received packet prior to transmitting it out another port.

Valid/ack Bus Mux (vabus_mux.v)

The valid/ack bus mux core provides the same functionality as the frame mux core but without any buffering or registering delays. This is useful where latency is important. A typical application is the muxing of custom transmit logic together with the normal SmartNIC transmit logic. This use case is shown in both the trigger and ping examples.

Custom Framegen (custom_framegen.v)

The custom framegen core generates a custom, broadcast, ethernet frame, that contains 4 QWORDS that are set by inputs to the module. This is useful for generating packets on the card and sending them to the host application. An example of this is shown in the ping example application, where the custom framegen core is used to send timestamps to the host.

The CUSTOM_ETHERTYPE parameter to the module allows the user to specify the ethertype of the frame. Setting the ethertype to a non-standard value will result in normal kernel processes safely ignoring the packet.

Asynchronous FIFO (async_fifo.v)

The asynchronous FIFO provides fast clock domain crossing between two domains. Data is written into the FIFO synchronous to clk_write when wren is asserted by the user, provided the FIFO is not asserting full.

Data is read from the FIFO synchronous to clk_read, on any cycle when vld and rden are both asserted.

Flag Synchronizer (flag_sync.v)

The flag sync module is used to cross a single bit flag between two asynchronous clock domains. The flag should be asserted for a single cycle in the input clock domain. Internal logic will then safely cross this flag such that it will then be asserted for a single cycle in the output clock domain. Note that this module assumes that the flag will be asserted relatively infrequently in the input clock domain.

Example Designs

The full source code is provided for all of the example applications described in this section. In all of the following examples a convention is used whereby register zero (0) in the development kit register address space reports a 'firmware ID'. This firmware ID is read by the software side of the example to verify that the correct firmware is running on the SmartNIC.

Trigger Example

The trigger example application allows users to pre-load the card with a pattern, mask and reply frame. The application performs a match on port 0 of any incoming frame against the pattern and mask, and if a match occurs the application will transmit the reply frame. This application can be used as a starting point for more advanced custom logic.

All source code for this application is included in the src/trigger_example directory of the development kit package. The files include:

  • ram_256_64.v, which implements a block RAM interface compatible with the development kit memory addressing scheme.

Two sample C applications for interfacing with this application is provided in the SmartNIC driver package, under examples/devkit/. One example, libexanic-responder-example, shows how to use the low level API to preload the card with a frame. The other, exasock-tcp-responder-example shows how to use Exasock extensions to integrate the host TCP state with the FPGA application.

In both of these applications software application primes the FPGA trigger to match on incoming IP frames, and loads in a dummy reply. The application reports any time the FPGA logic has triggered. The libexanic application can be started using

$ ./libexanic-responder-example exanic0

The exasock example will attempt to connect to the specified TCP address/port combination. Once a connection has been established, any UDP packet that is received on the UDP port will trigger a 'hello world' packet to be sent via the TCP connection. The exasock application can be started using:

$ exasock ./exasock-tcp-responder-example <udp-port> <tcp-addr> <tcp-port>

Note that the example application is only implemented on the FPGA for port 0, and all ports operate as normal network interfaces.

Ping Example

The ping example uses an ICMP echo request to perform a hardware timestamped ping. The firmware takes a source IP address and destination IP address. This triggers a state machine to start by checking an ARP table for an entry that resolves the remote IP to a MAC address. If no entry for the IP address is found, the hardware sends an ARP request for the IP out on the wire and waits for a reply. When an ARP reply is received, an entry is inserted into the ARP table and the ARP table lookup performed again. The hardware then sends an ICMP echo request, filling the body of the request with a hardware timestamp, then waits for a reply. When the reply is received, the hardware sends a custom frame to the software application that contains the transmit and receive timestamps. Both ICMP and ARP requests have timeouts of 1 second associated with them, and will result in an error message sent to the host on timeout.

459815.png

The ping example demonstrates the following functionality within the devkit:

  • Sending pre-defined packets with values of certain fields substituted with values calculated in the FPGA. This is demonstrated in the ARP framegen and ICMP framegen modules.
  • Parsing received packets and extracting information from them. The ICMP echo parse and ARP parse modules demonstrate this functionality.
  • Communicating with the software application by sending a custom frame from the FPGA via the DMA interface. The custom framegen module and frame mux modules demonstrate how to interleave custom frames with frames that are received from the wire.
  • Basic lookup table example (ARP table).
  • Use of hardware timestamping functionality.

459816.png

To run the ping example, use:

$ ./ping-example <device> <dst-ip> <src-ip>

This will send ARP and ICMP packets originating from src-ip to the host at dst-ip. The device must be a SmartNIC with the ping example firmware loaded.

Flow Steering Example

The devkit can be used to perform flow steering based on any field within an ethernet frame. The raw frame API, libexanic, can be used to allocate DMA buffers, each of which is automatically assigned a unique ID. This ID can be passed to the card and provided to the RX host interface in conjunction with the frame in order to steer the frame to that buffer. Applications include per-symbol filtering of market data or more advanced, stateful filtering.

The flow steering example provided in the devkit demonstrates how to use this functionality to steer IP packets destined for a particular IP address to a designated buffer. Applications that monitor this buffer will only see packets that are destined for this IP address. Users can adapt this application to their requirements.

The rx_buffer_host port in the devkit can be used to pick the host receive buffer that the current frame is sent to. The value applied by the user application to this port must be ready at the same time as the 15 th valid data beat is applied to the corresponding rx_host interfaces, or at the end of frame, whichever occurs first. Once set, this value must remain the same for the duration of the frame until EOF+2 cycles.

504634.jpg

Bridging Example

The bridging example demonstrates the use of the frame mux for bridging of two ports on the card. Bridging involves looping back any received data on one port to the transmit datapath on another port.

Loopback Example

The loopback example demonstrates the latency of the SmartNIC MAC layer. It does this by sending a packet out of port 0 as soon as the start of frame is seen on the RX datapath of port 0. Note that this demo logic just sends a small frame of all 0xFF's (plus CRC) - it does not loopback the incoming frame.

Testbench and Functional Model

The SmartNIC development kit is provided with a full functional model for all of the individual interfaces. This can be found in the tb/ directory of the package. The testbench consists of the following files:

  • bench.v,the top level harness that wraps the various modules contained in the functional simulation.
  • address_access.v, contains tasks that simulate access to the BAR0 and BAR2 memory spaces in the development kit (for example, register access and memory copies).
  • control.v, contains various control tasks and generates the timestamp counter.
  • dma_sim.v, simulates the SmartNIC frame DMA interface. Will log frames that have been transferred successfully, and indicate error conditions.
  • transmit_sim.v, simulates the SmartNIC ethernet transmit interface. Will log frames that have been transferred successful and indicate error conditions.
  • frame_sim.v,simulates either host frame transmission or frames received from the wire.
  • test_cases.v, container for user test cases. Users can add their own simulation directives here.
  • bench.prj, a project file for the Xilinx simulator that lists all files that make up the simulation. New files for a project should be added here to make sure they are picked up by the simulator.
  • start_sim.sh, a shell script that starts the Xilinx simulator in console mode. To start in graphical mode, use the switch -gui.

The example in test_cases.v shows how users can exercise the various elements of the functional mode, and provides a test case for the example design. Users can add their own test cases to this file as necessary.

Users can start the example testbench by running:

$ ./start_sim.sh

This will cause the testbench to be compiled and xsim to start in command line mode. From the xsim prompt, the simulation can be run for 10 microseconds by entering:

% run 10us

Debugging with Vivado

You can use Xilinx Chipscope Pro Integrated Logic Analyzer (ILA) to debug your FPGA designs using JTAG. Xilinx documentation on how to use Chipscope for debugging can be found here.

The default build of the trigger_example includes a Chipscope core to probe the following signals: rx_sof_host, rx_eof_host and rx_vld_host.

The definition of any signals to be probed must include the tag (* mark_debug="true" *).

A TCL script can be used to insert the ILA core to the netlist. In the given example, debug.tcl is sourced to insert the core. The signals will be captured with respect to a clock which is specified by the user. This clock is specified with the command connect_debug_port (line 13 of debug.tcl file). The signals that need to be captured should be listed at the TCL script using connect_debug_port (line 16 of debug.tcl).

In the given example, the compile.tcl file is modified such that debug.tcl will be sourced after synthesis. The command which is required for the probes file to be written is at to the end of compile.tcl.

Using the ILA can be done through a local or remote JTAG connection, as described below.

Remote JTAG Connection (XVC Server)

Xilinx supports a remote connection from Vivado to a server using Xilinx's Virtual Cable XVC protocol, which then connects to the FPGA. Cisco has a modified version of Xilinx's xvcServer utility that can be used with SmartNICs.

In order to support JTAG over XVC, some logic needs to be added to the design. Users should add the flag JTAG=1 to the make command when building FDK's to add this logic.

note.gif

Noteblank.gifThe addition of this logic instantiates a MASTER_JTAG primitive in the design, which then disables the external JTAG interface. If external JTAG access is required again, it is necessary to revert to an image without the JTAG redirection (either by flashing a new image with exanic-fwupdate, or by using the recovery button).


Build and run the exanic-xvcserver utility, which can be found in the examples/devkit directory:

$ sudo ./exanic-xvcserver exanic0
Waiting for connection on port 2542...

In Vivado, open the Hardware Manager as shown below. Note you can also use the Tcl command open_hw. 459811.png

Start a Hardware Server session with the following command on Tcl Console connect_hw_server or by selecting Open Hardware Manager from the Flow menu.

In the Tcl console, issue the open_hw_target command to connect to the machine running exanic-xvcserver. For example:

open_hw_target -xvc_url 172.16.0.210:2542

exanic-xvcserver should report connection accepted and the xcku035_0 device should be listed in Vivado.

If your design has an ILA core, in the Trigger Setup window for the ILA core, click on the link to specify debug probes and select the .ltx probes file which will be in the outputs/ directory.

459848.png

note.gif

Noteblank.gifCisco has seen instances where MIGs and ILA cores are not listed under the xcku035_0 device alongside SysMon. If you expect to see a core and dont, right click on the xcku035 device and select refresh.


warn.gif

Warningblank.gifUsers should not attempt to configure the FPGA using the XVC server, as this relies on the FPGA to be configured to handle the JTAG shift instructions.


note.gif

Noteblank.gifexanic-xvcserver detects whether the JTAG logic has been inserted into the design and will not attempt to connect to an exanic without it.


Local JTAG Connection (Xilinx Platform Cable)

The Xilinx Platform Cable can be used for connecting via JTAG. Ensure to connect the Platform Cable to the machine running Vivado, and that the Platform Cable is connected to the SmartNIC. There is a small edge connector on the top right corner of the SmartNIC that exposes the JTAG pins, and an adapter cable can be supplied by Cisco to connect the SmartNIC to the Platform Cable.

459807.png

In Vivado, open the Hardware Manager as shown above.

Start a Hardware Server session with the following command on Tcl Console connect_hw_server or by selecting Open Hardware Manager from the Flow menu.

Then click on Open target and then Auto Connect. 459812.png

You should now see the xcku035_0 FPGA listed. In the Trigger Setup window for the ILA core, click on the link to specify debug probes and select the .ltx probes file which will be in the outputs/ directory.

459848.png

Using JTAG to Configure the FPGA

The recommended method for configuring the FPGA is to load the image/bitfile using the exanic-fwupdate utility, however configuration via JTAG is possible. The default behaviour of the SmartNIC is to reconfigure the FPGA when the host is reset. When loading on an image via JTAG, it's important to disable this automatic reboot mechanism, otherwise the image that's in flash will be reloaded into the FPGA.

This is done by adding the NOREBOOT=1 flag when building the image:

make PLATFORM=x40 TARGET=trigger NOREBOOT=1

Connect to the FPGA via JTAG using the Xilinx Platform Cable. Users should not attempt to configure the FPGA using the XVC server.

Right click on the xilinx device and click Program Device.

459817.png

Recovery Image

All SmartNICs come with a recovery flash image for cases where a corrupt flash image has been written to the card. To start the card in recovery mode, hold down the small button marked 'recovery' located on the top edge of the card during a reboot of the host system. The red LED on the rear panel of the SmartNIC will then be lit. When in recovery mode, the corrupt flash image can be overwritten by using the exanic-fwupdate utility.

Change History

v1.3.5, 19-Mar-2018

  • Fix bug where the lack of PXE expansion ROM could issue unsupported PCIe transactions to the host
  • Fix bug where FPGA would not be reconfigured after host reboot

v1.3.4, 23-Jan-2018

  • Relax timing by adding false paths for some paths that don't need to be timed
  • Fix 2 issues where (tandem) builds might fail to load causing NIC to not work on PCIe. (Would show up on lspci but not with exanic-config)
  • Reset PCIe core on reconfiguration (improves hot reload reliability)
  • Add support for JTAG over PCIe
  • Support building with Vivado 2017.4
  • Improve reading of some SFP/QSFP modules via I2C
  • PPS termination is now disabled by default

v1.3.3, 28-Nov-2017

  • Supports firmware update/reload without reboot (except eval FDK)
  • Add Chipscope core to trigger example default build
  • Fix bug where NOREBOOT=1 was ignored for user FDK builds
  • Add optional support for iPXE with FDK
  • Fix bug in bridging example where frames could be corrupted host was sending a frame at the same time as a frame was being bridged from the other port
  • Fix bug in testbench (frame_sim.v) where ACK was not properly processed
  • Added support for synth with Vivado 2017.3

v1.3.2, 9-Oct-2017

  • Signal flash_dq_tristate to exanic_x10_devkit and exanic_x40_devkit modules has been renamed flash_dq_drive. Normally exanic_*_devkit is instantiated from the exanic_devkit top level, which has been updated accordingly, but if you are using a modified top level then it will need to be updated.
  • Change to transceiver setting in PCIe core to improve compatibility with some systems. As a result, Vivado 2016.4 or later must be used.
  • Fix bug in tandem logic placement constraints
  • Fix bug in SmartNIC K35-Q (X40) FDK where QSFP status could be read incorrectly from host

v1.3.1, 15-Aug-2017

  • Improve Flash programming settings to address occasional programming failure
  • Fix bug in tx_disable_padding where padding was disabled for the frame after the next frame, not the next frame

v1.3.0, 6-Jul-2017

  • NEW SIGNAL: tx_disable_padding: per-port option to disable padding of <64 byte frames received from host software (safe to leave unconnected)
  • Added ability to reduce size of TX buffers available for host use, in order to free up more BRAMs for user logic.
  • Added buildlog to catch how/when the FDK was built
  • Added support for synth with Vivado 2017.2
  • Restored ability to build images that do not use TANDEM boot (NOTANDEM=1)
  • Fix bug where HW_TIME register could read incorrectly from software
  • Fix bug where transmit timestamps could occasionally be incorrect

v1.2.1

  • Fixed bug in frame_sim.v where EOF was set incorrectly

v1.2.0

  • NEW SIGNAL: tx_err_net: allows the user to intentionally corrupt the FCS (safe to leave unconnected)

v1.1.9

  • Add support to FDK for synth with Vivado 2017.1

v1.1.8

  • Add false paths for SmartNIC FDK logic to assist in timing closure

v1.1.7

  • Add support to FDK for synth with Vivado 2016.3

v1.1.6

  • Fix bug in flow steering logic for SmartNIC K35-Q (X40) - DMA address decode now applies to upper 4 ports as well.

v1.1.5

  • Added synthesis support for Vivado 2016.1, 2016.2, was previously 2015.4 only. Note that Cisco has observed instances where incorrect logic is synthesized using Vivado 2016.1 and 2016.2
  • Fixed bug in loopback_example.v (tx_ack_host was unconnected)

Changes versus 7 series development kit

The SmartNIC Ultrascale series of development kit includes a number of changes relative to the previous generation (ExaNIC X2 and X4) 7 series development kits. These changes have been made to reduce the latency of the SmartNIC MAC layer and increase the bandwidth of the PCIe interface.

The following are changes that will affect user designs that were targeted to the X2 and X4 development kits:

  • Inclusion of a network clock domain (clk_net), to which all the MAC layer is synchronous. Users who wish to bring across existing applications can make use of the included asynchronous FIFO to perform clock domain crossing.
  • Memory bus is now address aligned. This reduces the amount of logic required to decode addresses on the memory bus, and ensures that a write to a byte at a specific address will always have data located in the same byte lane.
  • Wider memory bus (32 bytes). The memory bus width has expanded to 32 bytes wide, with a corresponding increase in the write enable bus width.
  • Ethernet framing interface case EOF & LEN = 0 now means no bytes valid on that beat. In the previous generation of development kit, this condition meant 8 bytes were valid on that data beat. This change more closely reflects the 10G ethernet end of frame condition and results in slightly lower latency.
  • _usr signals have been renamed to _net to clarify that they are synchronous to the network clock domain (clk_net).
  • CRC fail signal is now valid on the same cycle as EOF. In the previous generation devkits, this signal was valid on the cycle after the EOF.
  • rx_ and tx_ interfaces are now provided as wide arrays rather as separate interfaces for each port. For example, the 64 byte data path for each port in the network domain is concatenated together. This means that port 0 is rx_data_net[63:0] and port 1 is rx_data_net[127:64], etc.
  • A new build system that allows users to specify the target and hardware platform, as described above.