The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
The Cisco Nexus SmartNIC FPGA Development Kit (FDK) Pro enables the development of applications directly within the network card firmware using FPGA technology. This kit includes a range of examples that demonstrate potential applications:
The FDK-Pro is designed for Linux environments and requires Xilinx Vivado version 2019.2 or later. It supports:
The FDK-Pro is distributed as a tar file that organizes the project directory for ease of use. To download:
The FDK v2 minimizes the exposure of unsynthesized Cisco source code to end-users, incorporating more of it within the encrypted netlist. It provides restricted access, revealing solely the source code relevant to the target example designs, alongside RTL wrappers needed to connect user logic to the FDKs core functions or needed to connect to the included AMD/Xilinx IP cores, such as the GTY transceivers, PCIe core, etc
The FDK-Pro adopts a more open approach, providing increased access to the source code which gives visibility into various FDK components which empowers the users with faster development and debugging. In this new offering, the PCS (Physical Coding Sublayer) and FastMAC are still shared as an encrypted netlist.
The compile_common.tcl script covers much of the scripting in the build process. The next section gives an overview of the script, its functionality, role within the flows, and how it integrates into the overall system.
This is a shared script sourced during the building of packages in the FDK-Pro builds. It manages common build parameters across different builds. It extends variables common across the build scripts, like verilog_files or synth_args without overriding any value set previously by other scripts in the build process.
The FDK-Pro architecture has a new directory structure. The directory structure shared is shown below:
The list of files and directories inside /src is shown below:
The ultranic-rx module processes the data stream from the MAC layer, segmenting it into 120B/128B chunks; for DMA pass-through, allowing for an early look at the first 120B of the frame, the other 8B are metadata used in the DMA control path. These chunks are given to the DMA engine and in parallel the traffic is directed to appropriate RX buffers. It supports DMA width of 256 bits with PCIe Gen3.
The components included for this packet processing are:
rx_chunk_assembler: This module receives data from the MAC layer and is responsible for assembling chunks of data of size 128B (15 QWORDs off the wire + 1 QWORD footer) in ultranic_rx in preparation for DMA transfers. It writes these assembled chunks into an intermediate RAM, which interfaces directly with the DMA engine. Additionally, this module selectively blocks packets not intended for our MAC address, unless promiscuous mode is enabled.
rx_buffer: This module acts as a dual-port memory, facilitating efficient data storage and retrieval of chunks of the data.
flow_steer: Module for matching inbound packets to rules. For each received frame, this module indicates whether it matches an IP or MAC flow steering rule programmed by software.ie BCAM steering the packets to different buffer rings based on the matches of a 5-tuple of the mac addresses, ip addresses and the IP protocol (UDP or TCP). It also keeps track of the current {gen_id, chunk_id} in each DMA ring, and provides a software interface for programming rules.
rx_dma_engine: DMA engine facilitates the efficient movement of data from the RX buffer to system memory by organizing the data into DMA data packets of 256 bit width and generating a DMA valid signal to indicate the presence of valid data.
Additionally, a few of the exanic_rx files, such as frame_info, are exposed in FDK Pro architecture, though the core functionality of the DMA is sourced from files in the ultranic_rx folder. All of the frame information for flow_steer module, including the header structure, types, and values, can be obtained from this frame_info module.
Ultranic_tx:
This module serves as the transmission pathway in UltraNIC, facilitating the transfer of data from the host to the Tx PCS layer. It provides a memory write interface for injecting frames and a transmit trigger queue to initiate transmissions. The module comprises the transmit buffer, command queue, transmit engine, and flag synchronization modules.
Transmit Buffer: The transmit buffer interfaces with the host via an address-aligned PCIe interface, with its data depth parameterized. Its storage capacity is determined by the TX_RAM_DEPTH, which sets the buffer size. Both ports feature an aperture of 256 bits.
Command Queue: This component takes the commands from the register interface and generates the control to the transmit engine. Internally it has gray coded circular FIFO for both clock domain crossing and command buffering. It utilizes one queue per 512 bytes, and the depth of the command queue depends upon the TX_RAM_DEPTH.
CMD_QUEUE_DEPTH = (TX_RAM_DEPTH * RAM_WIDTH_BYTES / 512)
Transmit Engine: Responsible for handling data from the transmit buffer, the transmit engine interprets commands specifying the chunk header address offset. It reads the header, accordingly, facilitating the completion of the send operation.
Manages the register interface for the Ultrascale Series of ExaNICs.
This module acts as a PCIE memory interface that connects the AXI stream interface with various local interfaces, For use with a Gen3x8 256b wide at 250Mhz interface, achieving a DMA bandwidth of approximately 64GBit/s.
The Accelerated TCP Engine (ATE) is a hybrid SW/HW TCP transmit engine that allows user logic to generate TCP frames from hardware and operates in "0" clock cycles. ATE implements a part of TCP in conjunction with exasock, the kernel bypass library for Cisco Nexus SmartNIC (formerly ExaNIC) cards. The software establishes a TCP connection and continues to provide the relevant connection state information to the firmware. Header generation, checksum calculation, as well as send and ack sequence numbers are thus handled internally by ATE without the involvement of custom user firmware or software.
ATE is designed to provide the lowest latency and smallest device footprint possible. The firmware only contains logic that is strictly necessary to send frames back-to-back at the lowest latency. Connection establishment, teardown, input handling, ACK handling, and windowing are all performed in software. As a result, ATE adds no additional latency over the standard TX MAC interface and requires roughly 1300 LUTs, 9 block rams and an Ultraram for a single port capable of 512 independent connections.
Following are the inputs to the tcp_engine:
The critical path is from the hw_tcp_payload _* _net signals to the tx_*_net signals.
raw_frame_padder: This block receives input from ultranic_tx and adds required "0" padding to the input data when padding is disabled in the transmit engine.
host_to_net_data_inject: The host_to_net_data_inject block aligns the data cycle with the Start of Frame (SOF), which is output by the DMA engine a cycle after SOF. This block also handles host to net clock conversion and converts data from 64-bit to 32-bit width.
header_formatter:
The header formatter is responsible for serializing out the MAC, IP and TCP headers. Header fields come from three sources: Hardcoded constants, Calculated fields and Fields stored in ultraram. Per connection register space:
Offset | Name |
---|---|
0 | Dest_mac |
1 | Dest_mac, Source_mac |
2 | Source_mac |
3 | ethertype, beginning of ip |
4 | length, identification |
5 | flags, frag, ttl, protocol |
6 | checksum, source_ip |
7 | Source_ip, dest_ip |
8 | dest_ip, source_port |
9 | dest_port, seq_num |
10 | seq_num, ack_num |
11 | ack_num, flags |
12 | window_size, tcp_checksum |
13 | urgent pointer, payload |
The calculated fields are the IP length, TCP length, IP header checksum and TCP checksum. User firmware supplies the payload length. The IP and TCP lengths are calculated by adding different constants to this number. The IP header checksum is a function of only the fields in the IP header. The TCP checksum requires a partial checksum of the payload so this needs to be supplied by the user firmware a few cycles in.
The non-constant, non-calculated, parts of the headers are stored in ultraram. Once the header serialization is finished, it switches the output mux over to user firmware so that it can provide the payload. It also starts acking the user firmware, so it knows to progress the stream. In parallel, it adds the current packet size to the sequence number stored in ultraram so that it has the correct value ready for the next packet.
tx_data_tap: This block receives tx_data_net as input and does the width conversion from 32 bit to 64 bit and the clock domain crossing from clock net to clock host.
rx_data_tap: This block receives rx_data_net and does the same width conversion and clock conversion like tx_data_tap.
In the FDK-Pro, there are two significant variants: VARIANT and PCS_VARIANT.
In FDK-Pro, in the compile_common script, the pcs_variant is dynamically generated by filtering out non-PCS related substrings from the VARIANT.
The synthesis arguments are the arguments applied during the synthesis of general, non-PCS components.
Cisco provides a specific license file required by Xilinx Vivado to synthesize the FDK Pro development kit.
For local installation, place the license file in your ~/.Xilinx/ directory and ensure the ~/.flexlmrc file includes XILINXD_LICENSE_FILE=/home/username/.Xilinx to set the license search path correctly.
For installation on a license server, modify the license file by replacing Xilinx_SERVER with the server's name and SERVER_PORT with the port number, typically 2100. Start the FlexLM License Server with the command:
lmgrd -c Xilinx.lic:exablaze_fdk.lic
If Vivado fails to find the license, it will generate a synthesis error indicating a missing license for the netlist cell 'exanic_v5p_devkit', instantiated as 'exanic_x10_devkit_inst'. Refer to the Xilinx documentation and the FAQ for more details on the Vivado licensing system.
When generating a license, Cisco will request your "host ID". The host ID is the MAC address of any physical network interface on your license server (for network/floating licenses) or the host running Vivado (for node-locked licenses). The MAC address can be determined using the ifconfig or ip addr commands, e.g.:
$ ifconfig ... enp8s0: flags=4163 mtu 1500 inet 10.0.0.100 netmask 255.255.255.0 broadcast 10.0.0.255 inet6 fe80::ae22:bff:fe78:184c prefixlen 64 scopeid 0x20 ether **ac:22:0b:78:18:4c** txqueuelen 1000 (Ethernet) ... $ ip addr ... 2: enp8s0: mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether **ac:22:0b:78:18:4c** brd ff:ff:ff:ff:ff:ff ...
The development kit includes a build system comprising a Makefile and a Vivado TCL script (compile.tcl) for various fully functional example applications. To initiate, source the Vivado environment using:
$ source /opt/Xilinx/Vivado/2019.2/settings64.sh
The Makefile requires specifying PLATFORM, TARGET, and optionally VARIANT. Run make without arguments to view available options for these parameters. PLATFORM designates the target card (e.g., x25, x100 or v5p). TARGET specifies the application to build, such as:
The source code for each application is contained in its own directory under the src/ directory. Users can create their own targets just by creating a new directory under src/.
VARIANT specifies the 'variant' of the netlist to use. Each netlist may be compiled with different options, for example full_multirate is a netlist that contains multirate (1/10G) support at the expense of an extra cycle of latency. If VARIANT is not provided, the default netlist will be used.
To build a native_loopback_example for the SmartNIC+ V5P, using the full_fastmac variant:
$ make PLATFORM=v5p TARGET=native_loopback_example VARIANT=full_fastmac
In the FDK-Pro, when constructing a design with the full_fastmac_txbuf64 variant, it suffices to specify the VARIANT as full_fastmac. This adjustment is necessary as the txbuf64 component does not affect the PCS logic but is solely relevant to host-side operations.
Moreover, within the FDK-Pro, one would rather specify the tx_buf size in the Makeconf file. The non-pcs variants (variants related to host) are configured in the Makeconf file inside the package before build. The variants that can be configured through Makeconf are hw_time64, txbufN, rxbufN, rxhostwidth256, ip_rulesN, mac_rulesN, no_dma, enable_ate.
In addition to the target, platform and variant, there are also several optional build flags:
These parameters are configurable in the Makeconf file, which includes explanations for each. Adjustments to these settings should be considered based on specific hardware behaviors and troubleshooting needs.
The build system outputs several files, including a SmartNIC firmware image (.fw) in the outputs/ directory. This firmware can be updated on a SmartNIC using the exanic-fwupdate utility. After updating and rebooting, executing exanic-config will display the system configuration and status like the following:
`$ exanic-config exanic1
Device exanic1:
Hardware type: ExaNIC V5P
Serial number: 643F5F01956C
Temperature: 34.2 C VCCint: 0.85 V VCCaux: 1.81 V
Fan speed: 5835 RPM
Function: customer application
Firmware date: 20240502 (Thu May 2 06:55:54 2024)
Customer version: 1714632954 (663338fa)
External 12V power: detected`
The hot reload feature allows for FPGA reload/reconfiguration without rebooting the host system.
The firmware date indicates when Cisco built the FDK, while the customer version shows when the image was built by the customer. Use the date command to convert these timestamps into a readable format if needed. $ date -d @1714632954 Thu May 2 16:55:54 AEST 2024
To create a new target application in the development kit, customers can start by duplicating and modifying an example design or developing a new one from scratch. This involves creating a new directory under src/, e.g., src/my_app, and including the necessary files.
The following is an example of a config.tcl file that demonstrates use of the available configuration options:
set app_ports {regif memif net_rx net_tx host_rx host_tx user_led} set app_physical_ports {0 1} set net_data_width 32 set clocking_model native set debug_clk clk_tx
The configuration files are essential for defining how the application interacts with Cisco IP and system hardware. The following are the configuration options currently available:
Value | Description |
---|---|
host | All buses are synchronous. to clk_host |
dual | The dual clocking model synchronizes network interface buses to a single network clock (clk_net) for all port transactions and uses a separate host clock (clk_host) for host-side interfaces. This mode exists for compatibility with old FDKs and should not be used for new projects. Currently only supported with net_data_width = 64. |
native | Network receives data synchronized to clk_rx_net[*], with a unique clock for each RX port. Network transmit data uses clk_tx_net, shared across all TX ports, while other interfaces use clk_host. This model requires a net_data_width of 32. |
Avoid using the "dual" clocking mode in new projects due to its focus on compatibility with older FDK versions. Choose between "host" or "native" modes based on your latency and usability needs. Host mode is suitable if your workflow involves frequent host interactions, providing latency similar to native mode but with easier clock domain management. It adds one clock cycle of latency when using a net_data_width of 32 due to a stream pipeline in the TX path. Native mode is optimal for high-performance requirements where the host is not part of the critical data path, but it requires manual management of all clock domain crossings.
The SmartNIC development kit provides extensive access to transmit and receive datapaths, and a user-accessible register and memory space. The top-level file, such as exanic_v5p_devkit.v for v5p, serves as a wrapper that integrates the SmartNIC IP core netlist with custom user applications. This setup ensures essential connections between the netlist and the user's application for seamless integration. The example designs included in the kit, which come with pre-established connections, offer a solid foundation for users to enhance and adapt for additional functionalities. Similar structures are used across different platforms, with variations in filenames.
The user interface for the SmartNIC development kit includes key clocking and reset signals:
The SmartNIC development kit includes several optional ports, which are available based on the app_ports settings:
The user register interface supports up to 2048 readable and/or writable 32-bit registers, where all read and write operations use full 32-bit words without byte-specific enables. All signals are synchronized to clk_host. The interface utilizes the following signals:
The user memory interface in the SmartNIC development kit is designed for write-only operations from the host, suitable for storing transmit buffers without support for host read-back. It operates synchronized to clk_host with key signals including:
This interface maintains address alignment, ensuring mem_w_addr[2:0] remains zero and byte enables dictate write locations. Memory mapping features non-cached and write-combining attributes, causing potential reordering and combining of writes in the CPU’s buffer. Memory state synchronization with the FPGA can be achieved by flushing the write buffer through a register write, guaranteeing memory state accuracy as seen by the firmware.
Users can enable additional BARs beyond the standard allocations if their application requires more memory spaces. BAR1 is configured similarly to the register space, and BAR4 mirrors the behavior of the memory space. These extra BARs are reserved exclusively for user applications and can be resized or reconfigured as needed by editing the .xci files located in the ip/ directory of the FDK archive. For guidance on utilizing these additional BARs, refer to the provided software and firmware examples described under "extra BARs example." The behavior of signals in these additional spaces mirrors those in the standard register and memory interfaces. They are listed below for convenience.
The user application interfaces with the Cisco low-latency MAC to send and receive packets on the network. However, certain interface signals like rx_early_sof_net, tx_eof_no_crc_net, and tx_abort_frame_net are only available with the 10G PCS/MAC, not the 100M & 1G versions. Key signals for received data include:
Signals are synchronized to clk_rx_net in native mode, clk_net in dual mode, and clk_host in host mode. Signal width scales with port numbers, and bit slicing is used for signal selection per port. The application must process packets at line rate, as there is no mechanism to apply back pressure.
The SmartNIC development kit provides a transmit interface that allows the user application to both monitor and manipulate outgoing Ethernet frames, as well as transmit its own frames. All Ethernet frames from the user must start with the destination MAC address's first byte and end at the payload's last byte. The SmartNIC handles CRC calculations and appending automatically. The FPGA application has the following signals which connect through to the Ethernet transmission logic:
These signals are synchronized to clk_tx_net in native mode, clk_net in dual mode, and clk_host in host mode. It is important to note that there is no mechanism to stall packet transmission once started, and the tx_ack_net may drop unexpectedly. For the 100M and 1G PCS/MAC, unsupported signals when asserted do not cause harm but have no effect, except tx_eof_no_crc_net which acts like tx_eof_net.
The provided timing diagrams clarify edge cases and ambiguities related to packet reception in the SmartNIC development kit.
The diagram illustrates a typical packet reception starting with the sof signal activating and ending with eof. Notably, the vld signal may deactivate at any point, as demonstrated when it drops mid-packet rendering the data invalid during that cycle. Additionally, the len signal is only relevant when eof is asserted. The timing of the crc_fail signal's validity varies based on whether the FDK is configured with the "Extra CRC Reg" option by Cisco; in the demonstrated scenario with this option enabled, crc_fail activates two cycles post-eof.
The timing diagrams provided illustrate packet transmission dynamics, showing how a typical packet is transmitted, akin to the reception process but with the valid signal replaced by ack. At the start of transmission, both sof and the first data word must remain constant until the ack signal is confirmed high on a rising edge. Note that ack may drop unexpectedly during packet transmission, requiring all signals to maintain their current values in the subsequent cycle. Additionally, ack can be high even when a packet is not being actively transmitted, as shown in the first cycle of the waveform. In this instance, ack being high indicates that the MAC is prepared to initiate packet transmission immediately. However, this readiness does not guarantee readiness in subsequent cycles unless the user firmware asserts sof on the current cycle, triggering the start of packet transmission.
The diagrams depict common pitfalls and corner cases in packet transmission, specifically highlighting a scenario where the ack signal is deasserted at the end of a frame. In such cases, all related signals, including eof, len, and ack, must remain constant while ack is deasserted to ensure proper handling and stability in the transmission process.
The diagram shows packet transmission where the last chunk size is zero. When eof is asserted with no valid bytes on the last cycle, it indicates that the packet actually ended in the previous cycle, and no valid data bytes are necessary in the current cycle.
The diagram illustrates the transmission of back-to-back packets, showing that once the end of a frame is acknowledged, the ack signal is deasserted and remains so for several cycles. This de-assertion by the MAC prevents the user from violating Ethernet’s interpacket gap requirements.
The MAC interface in our development kit, while similar to the familiar AXI stream used by many FPGA engineers, incorporates specific differences for the sake of efficiency.
RX Side:
TX Side:
The host-side interface of the SmartNIC development kit facilitates bidirectional packet forwarding between the host software and the network. This interface mirrors network-side semantics, allowing straightforward connections between host and network data paths, such as connecting rx_*_net to rx_*_host for incoming packets and tx_*_host to tx_*_net for outgoing packets. This setup can function as a basic network interface or be enhanced with more complex interactions.
Host Receive Signals:
Host Transmit Signals:
Additional Features:
These signals are vital for managing data flow and ensuring accurate packet handling between the host and the network, providing a robust platform for advanced networking applications within the SmartNIC framework.
The DDR interface provides read and write access to DRAM installed on the card. Note that not all SmartNICs have DRAM installed:
The DDR interface on the SmartNIC includes three categories of signals: reading, writing, and shared signals between both operations. Here is a concise overview:
Shared Signals:
Read-Specific Signals:
Write-Specific Signals:
These signals facilitate robust DDR memory operations, allowing for efficient data writing and reading processes in the SmartNIC's architecture.
The user application in the SmartNIC development kit interacts with host software by accessing register and memory spaces, and by modifying or tagging packets before they are transferred to the host. This is facilitated by libexanic functions:
These address spaces are customizable according to the specifics of the user's FPGA application. An example from the trigger_example.v in the FDK shows how register reads are handled:
`/* Register reads. */
always @ (posedge clk_host) begin
reg_r_ack <=reg_r_en;
case (reg_r_addr)
`h0: reg_r_data <= FIRMWARE_ID;
`h1: reg_r_data <= VERSION;
`h2: reg_r_data <= armed;
`h3: reg_r_data <= match_length;
. . .`
Reading from register 0 outputs the value 0xEB000001, demonstrated by the command:
$ ./exanic-devkit-register-read exanic0 0 0x000: 0xEB000001 (-352321535)
Additionally, the application can send dummy Ethernet frames with custom ethertypes to the host for advanced data communication. These frames are DMA transferred and processed by the host using libexanic, including any user-defined data. This setup allows for robust and flexible interaction between the FPGA and host system software.
The SmartNIC driver package features exasock extensions, enhancing applications with the ability to access the next set of TCP headers for specific sockets. In conjunction with the development kit, these functions empower the host to handle TCP state management through kernel sockets bypassed transparently by exasock, enabling the SmartNIC to deliver rapid responses to predefined user events.
An example exasock-tcp-responder-example.c, demonstrates using these capabilities with the trigger example firmware. This example illustrates how standard UNIX socket calls can establish a TCP connection to a server, and how the SmartNIC can send a TCP reply following the reception of a UDP packet.
The SmartNIC FPGA development kit ships with source code for IP cores that are useful for performing common tasks.
The field extract core can be used to extract an arbitrary length field from received frames. To use the core, instantiate it by specifying the parameters BYTES(The byte width of the field to extract) and OFFSET(The offset in bytes of the field in the frame, measured from the start of the frame). Examples of using this core are shown in the ping and flow steering example applications.
The frame mux core provides a way to share a single frame output interface (for example, rx_host or tx_usr) between two sources of frames. It provides buffering so that interfaces that cannot be 'stalled', such as the receive interface, can be arbitrated without loss of data. The frame mux also allows two ports to be 'bridged' together, much like the SmartNIC bridging functionality. As an example, the frame mux can be used to connect port 0 receive to port 1 transmit, whilst also allowing the host to transmit via port 1. In this mode of operation, the frame mux has an optional FCS removal mode.
The frame mux core has the following parameters:
The valid/ack bus mux core provides the same functionality as the frame mux core but without any buffering or registering delays. This is useful where latency is important. This use case is shown in both the trigger and ping examples.
The custom framegen core generates a custom, broadcast, ethernet frame, that contains 4 QWORDS that are set by inputs to the module. An example of this is shown in the ping example application, where the custom framegen core is used to send timestamps to the host. The CUSTOM_ETHERTYPE parameter to the module allows the user to specify the ethertype of the frame.
The asynchronous FIFO provides fast clock domain crossing between two domains. Data is written into the FIFO synchronous to clk_write and data is read from the FIFO synchronous to clk_read.
The flag sync module is used to cross a single bit flag between two asynchronous clock domains. The flag should be asserted for a single cycle in the input clock domain. Note that this module assumes that the flag will be asserted relatively infrequently in the input clock domain.
The asymmetric memories provide block ram backed 256-bit write and 32-bit or 64-bit read capability. They are intended for designs where packet data is received from the 256-bit PCI memory write interface and sent out one of the network interfaces.
The stream pipeline module is used to break up long timing paths that stream data with valid and ack signals. It is particularly useful when transferring data between Ethernet and PCI which, on the SmartNIC+ V5P, are at opposite ends of the chip.
These modules can be used to convert between the streaming interfaces of the MAC and DMA engines, which have different data widths.
There are some more cores shared to customer who is purchasing FDK-Pro. Those cores are explained in the session “Additional Codes Exposed in FDK-Pro”.
The full source code is provided for all the example applications described in this section. In all the following examples a convention is used whereby register zero (0) in the development kit register address space reports a 'firmware ID'. This firmware ID is read by the software side of the example to verify that the correct firmware is running on the SmartNIC.
The trigger example application in the development kit allows for the preloading of a card with a pattern, mask, and reply frame. It matches incoming frames on port 0 against the pattern and mask, and if a match is detected, it sends a predefined reply frame. This setup is useful for building advanced custom logic applications.
All source code for this application is included in the src/trigger_example directory of the development kit package. The files include:
The software application libexanic-responder-example demonstrates preloading frames using a low-level API, available under examples/devkit in the SmartNIC driver package. The libexanic application can be started using:
$./libexanic-responder-example exanic0
The software application exasock-tcp-responder-example integrates TCP state with FPGA logic, responding to UDP packets with a TCP 'hello world' packet. The exasock application can be started using:
$ exasock ./exasock-tcp-responder-example <udp-port> <tcp-addr> <tcp-port>
Note that the example application is only implemented on the FPGA for port 0, and all other ports operate as normal network interfaces.
The devkit facilitates flow steering based on Ethernet frame fields using the libexanic API, which allocates DMA buffers with unique IDs. These IDs enable frame steering to specific buffers for applications like market data filtering. Included in the devkit is a flow steering example demonstrating IP packet steering to a designated buffer, ensuring only packets for a specified IP address are received. This setup can be customized for various applications.
The rx_buffer_host port selects the host receive buffer for incoming frames. The buffer ID must be set by the 15th data beat or at frame end, whichever is first, and remain constant through EOF+2 cycles. Additionally, setting all rx_buffer_host bits to 1 will drop the frame before it reaches any buffers. This setting should also occur by the 15th data beat or frame end. This example can be tested using filter.c or custom_filter_example. Usage for custom filter example:
./exanic-software/src/examples/devkit/custom-filter-example <device>:<port_number> <dst_ip> [expected_matches]
The usage for filter.c is as follows:
./exanic-software/testing/atf/src/filter <device>:<port> <dst_ip> <expected_num_matches>
The native loopback example also demonstrates the latency of the SmartNIC MAC layer but loops back the frames received from the RX datapath on port 0 back out of port 0. This includes a CDC to transfer data from the receive domain to the transmit domain and 3 cycles of buffering to prevent TX underrun issues.
The multi preload tx example in the devkit enables users to preload frames into FPGA memory and broadcast them simultaneously across multiple ports via a single register write.
Each port contains memory for 32 packets, 2048 bytes, and a separate metadata RAM that stores each packet's size as a 16-bit value. Packet buffers per port are independent, with each byte individually writable, allowing for updates to specific packet fields as needed.
To dispatch a packet, the software writes a 32-bit value to register address 0x0. This value includes <24-bit port mask> < 3 unused bits> <5-bit index> allowing selection from 32 preformatted packets. Different packets can be sent from different ports simultaneously using one register write.
The port mask specifies the ports for packet dispatch, e.g., a mask of 0x3 sends packets through ports 0 and 1. This setup also functions as a standard NIC, supporting usual packet send/receive operations via DMA.
Software for this design is available in the exanic-software repository under- examples/devkit/exanic-devkit-multi-preload-tx-example.c.
While the regular libexanic TX API also supports packet precaching, this design's unique feature is its ability to simultaneously send packets through multiple ports. For single-port operations, using the libexanic API is recommended.
The bridging example demonstrates the use of the frame mux for bridging of two ports on the card. Bridging involves looping back any received data on one port to the transmit datapath on another port. Note that this example will not work when a different line rate is used on each side of the bridge as there is no buffering added. The packet sent through port 0 will be received back through port 1 of the device.
The soft responder example demonstrates the latency of the SmartNIC MAC layer. It does this by sending a packet out of port 0 as soon as the start of frame is seen on the RX datapath of port 0. Note that this demo logic just sends a small frame of all 0xFF's (plus CRC).
This is a minimal example of how to use the PCI register interface at BAR0. The memory interface at BAR2 is similar.
It is tested using the applications for read and write operations in exanic software.
For read operation: ./exanic-software/src/examples/devkit/exanic-devkit-register-read <device> <start_address> <end_address>
For write operation: ./exanic-software/src/examples/devkit/exanic-devkit-register-write <device> <start_address> <data>. This write will generate a packet out from port 0.
This is a simple packet generator for transmitting closely spaced frames of varying sizes. It can be configured by host software using the PCI register interface.
Software to drive the packet generator is available in the exanic-software repository at examples/devkit/spam-example.c. The usage for using the application is as follows:
$ ./exanic-software/src/examples/devkit/spam-example <device> \[-c num-frames\] \[-s min-size\] \[-S max-size\] \[-d dst-mac\] \[-g inter-frame-gap\] \[-b num-bursts\] \[-G inter-burst-gap\]
Example usage:
$ ./spam-example exanic1 -c 100 -s 60 -S 80 -g 0
This will send 100 frames with sizes 60 to 80 back-to-back.
Note that if the -c argument is not provided, it will send frames forever.
The ping example in the devkit executes a hardware-timestamped ICMP echo request using source and destination IP addresses. It begins by checking the ARP table for a corresponding MAC address. If absent, it sends an ARP request and waits for a reply to update the table. Subsequently, it dispatches an ICMP echo request with a hardware timestamp and awaits a reply. If responses are delayed beyond 1 second for ICMP or ARP requests, an error message is sent to the host.
Key functionalities demonstrated include:
To run the ping example, use: $ ./ping-example <device><dst-ip><src-ip>.
This will send ARP and ICMP packets originating from src-ip to the host at dst-ip.
This is an example of using early_sof to trigger a packet. It does this by sending a packet out of port 0 as soon as the start of frame is seen on the RX datapath of port 0.
This is a minimal example of using the ddr4 via the MIG interface. The read, write and reset operations to a DDR4 is carried out using the ddr4_example application provided in the exanic-software.
Usage for testing this example is:
$ ./exanic-software/src/examples/devkit/ddr4_example <device> <wr|rd|rst>
This example mimics the behavior of a NIC. The MAC rx ports are connected to the host. The packets received through rx ports of MAC are sent to the host. For example, let's assume port 0 of testing device ‘device0’ is connected to port 0 of another device ‘device1’. If a packet is sent from port 0 of device1, the same packet will be received in capture at port 0 of device0.
This example is like the native NIC example, with the primary distinction being the RX host data width, which is 256 bits. The main objective of this example is to evaluate the FDK-Pro using the variant rxhostdatawidth256.
This is a simple example that rapidly sends packets of varying contents and lengths. Host spam example can be tested using the spam_example.c application provided in the exanic-software directory. The usage of spam_example.c is explained in the native_spam_example session.
The steer_256b_example is like the steer example. The primary distinction between both being the RX host data width, which is 256 bits. The main objective of this example is to evaluate the FDK-Pro using the variant rxhostdatawidth256. This example can be tested using filter.c or custom_filter_example as explained in the steer_example session.
This is a minimal example of how to use the additional register/memory spaces at BAR1 and BAR4. Software to drive this example is available in the exanic-software repository at examples/devkit/exanic-devkit-extra-bars-read.c and examples/devkit/exanic-devkit-extra-bars-write.c. For example, to read from offset 0 in BAR1 from a device at PCI address 01:00.0:
$ ./exanic-devkit-extra-bars-read/sys/bus/pci/devices/0000\\:01\\:00.0/resource1 0
Native trigger example will trigger a packet out of port 1 when a packet is received at port 0. The packet received at port 0 should have first 32 bits are “0xffffffff”.
tcp_trigger_example is like native_trigger_example. It sends a TCP packet from port 1 on receiving an ethernet frame from port 0 if the first four bytes of the destination MAC address are 0xffffffff. ATE must be instantiated on port 1, that is, bit 1 in the TCP_ENABLES bitmask must be set.
The SmartNIC development kit includes a complete model of all interfaces, located in the tb/ directory. The testbench files include:
To run the testbench:
$ ./start_sim.sh
This compiles the testbench and starts xsim in command-line mode. To run the simulation for 10 microseconds, enter at the xsim prompt:
% run 10us
Use the Xilinx Chipscope Pro Integrated Logic Analyzer (ILA) to debug FPGA designs through JTAG, detailed instructions and further documentation are available on Xilinx's website. The standard setup in the chipscope_example includes a Chipscope core that allows you to monitor specific signals, with additional configuration options available such as debug_clk. Signals for probing must be marked with (*mark_debug="true"*). Insertion of the ILA core into the design is handled by the debug.tcl script, which also specifies the capture clock.
Debugging can be performed via both local and remote JTAG connections.
Xilinx supports remote debugging through its Virtual Cable (XVC) protocol, allowing Vivado to connect to an FPGA via a server. Cisco offers a modified xvcServer utility for use with SmartNICs. To enable JTAG over XVC, include the JTAG=1 flag in the make command to incorporate necessary logic into the FDK. NOTE that the addition of this logic involves the MASTER_JTAG primitive, which disables the external JTAG interface. To regain external JTAG access, replace the FPGA image using exanic-fwupdate or the recovery button.
To operate the Cisco-modified xvcServer, which can be found in the examples/devkit directory of the exanic-software repository:
`$ sudo ./exanic-xvcserver exanic0
Waiting for connection on port 2542 …`
In Vivado, use the Hardware Manager as shown below or the open_hw Tcl command to manage hardware connections.
To start a hardware session in Vivado, either use the Tcl Console command connect_hw_server or select Open Hardware Manager from the Flow menu.
In the Tcl console, use the command:
open_hw_target -xvc_url 172.16.0.210:2542
This command connects to the machine running exanic-xvcserver. Once connected, exanic-xvcserver will acknowledge with "connection accepted," and the device (e.g., xcku035_0) should appear listed in Vivado.
If your design includes an Integrated Logic Analyzer (ILA) core, go to the Trigger Setup window, click on the link to configure debug probes, and select the .ltx probe file located in the outputs/ directory.
Note
Cisco has seen instances where MIGs and ILA cores are not listed under the xcku035_0 device alongside SysMon. If you expect to see a core and don't, right click on the xcku035 device and select refresh.
Warning
Users should not attempt to configure the FPGA using the XVC server, as this relies on the FPGA to be configured to handle the JTAG shift instructions.
NOTE that exanic-xvcserver detects whether the JTAG logic has been inserted into the design and will not attempt to connect to an exanic without it.
The Xilinx Platform Cable can be used for connecting via JTAG to the SmartNIC K3P-S (X25) and SmartNIC K3P-Q (X100). Connect one end of the Platform Cable to the machine running Vivado, and the other end to the SmartNIC.
The SmartNIC+ V5P has 2 methods for connecting to the device with a local JTAG connection.
Users can plug the standard 14 pin JTAG cable from a Xilinx Platform Cable or equivalent JTAG pod into the SmartNIC + V5P. Note that when this loom is inserted, connectivity from the USB JTAG circuitry to the FPGA is disabled.
Users can plug a USB cable into a connector on the PCIe bracket of the SmartNIC+ V5P to gain JTAG access to the FPGA. Note that the USB JTAG interface and "14 way" interface cannot be used simultaneously. If a loom is connected between the 14 pin header and a Xilinx pod, the USB JTAG circuitry will be disconnected from the FPGA.
In Vivado, open the Hardware Manager. Start a Hardware Server session by entering connect_hw_server in the Tcl Console or selecting Open Hardware Manager from the Flow menu. Click Open Target and then AutoConnect; the FPGA (eg: xcku035_0) should now be visible. In the ILA core's Trigger Setup window, specify debug probes by selecting the .ltx probes file from the outputs/ directory.
To configure the FPGA, the preferred method is using the exanic-fwupdate utility, though JTAG configuration is also feasible. By default, the SmartNIC reconfigures the FPGA upon host reset. To prevent this and retain a JTAG-loaded image, disable the automatic reboot by building the image with the NOREBOOT=1 flag:
make PLATFORM=v5p TARGET=native_loopback_example VARIANT=full NOREBOOT=1
For JTAG connections, use the Xilinx Platform Cable. Do not configure the FPGA via the XVC server. To program the device, right-click on the Xilinx device in the software and select Program Device.
Vivado may encounter issues booting from configuration flash when a JTAG cable is connected and the HW Manager is open, potentially leaving the FPGA unprogrammed. According to the Vivado Programming and Debugging User Guide (UG908) v2019.2, configuration failures can occur on power up if the Hardware Manager's polling and recover feature interrupts the Master mode configuration. To prevent this, disable updates to the configuration status registers by setting the following parameter in the Vivado Hardware Manager Tcl console:
set_param xicom.allow_cfgin_commands false
This setting will stop automatic updates for all devices in the JTAG chain, resulting in outdated values for registers like REGISTER.CONFIG_STATUS and REGISTER.BOOT_STATUS. To view correct and updated register values post-bootup, re-enable the parameter and refresh the device:
set_param xicom.allow_cfgin_commands true refresh_hw_device [get_hw_devices <device_name>
Alternatively, right-click the device in Vivado HW Manager and select Refresh Device.
Cisco sets specific build options for the FDK-Pro at build time using "netlist variants," which are .edn netlists located in the src/ directory for each variant. Users select a variant by setting the variant flag during the make process. For example:
$ make PLATFORM=v5p TARGET=native_loopback_example VARIANT=full_fastmac
Key netlist variants include:
Note: Full refers to license validation, the netlist requires a full license. All other substrings of the netlist name (excluding the substring referring to the platform itself) refer to features of the PCS logic.
Note
Currently 100M/1G is only enabled for the first lane of each physical connector on a SmartNIC, when using a multirate variant. This means that 100M/1G support will only be enabled on port 0 and 4 of the SmartNIC+ V5P and SmartNIC K3P-Q (X100). The SmartNIC K3P-S (X25) can support 100M/1G on both physical ports if using a multirate variant.
Ports which do not have 100M/1G enabled can only operate at 10G.
All SmartNICs include a recovery flash image for rectifying corrupt flash situations. To enter recovery mode, hold the 'recovery' button on the card during host system reboot. Indicator of recovery mode is an alternating amber link lights on the SmartNIC+ V5P. Use the exanic-fwupdate utility to overwrite the corrupt image in this mode.
If physical access to a SmartNIC is unavailable for standard recovery, remote recovery via Vivado and JTAG is possible under these conditions:
Remote Recovery Steps:
Following these steps will enable remote recovery of the SmartNIC through Vivado.
NIC | Flash Part Number | Config |
---|---|---|
Cisco Nexus SmartNIC K3P-S (formerly X25) | S29GL256P11FFIV20 | NOR BPI x16 |
Cisco Nexus SmartNIC K3P-Q (formerly X100) | MT25QU128ABA1EW7-0SIT | QSPI |
Cisco Nexus SmartNIC V5P | MT28EW01GABA1LPC-0SIT | NOR BPI x16 |
Platforms which are currently supported: