Introduction
This document describes the DDF reload event when the Cyclic Redundancy Check (CRC) Error occurred. The event is reported with Simple Network Management Protocol (SNMP) trap DDFreload. The trap is introduced with the CRC Error Restart Notification for Operation and Maintenance feature.
Problem
DDF Field-Programmable Gate Array (FPGA) is a DMA engine on the DPC and DPC2. DDF FPGAs are susceptible to receive a CRC_ERROR. The DDF FPGA driver decides whether or not the error can be recovered. This decision is based on how many times and the rate at which these errors occur. When the driver decides that these errors can be recovered, it indicates to the application program that such an error has occurred.
Thu Apr 01 02:54:09 2021 Internal trap notification 1332 (DDFreload) card 3 ddf-dev DDF1
2021-Apr-01+02:54:09.277 card 3-cpu1: Bad dheader magic number. previous=0xf1234567 (p[12345678.123456] mcdma: MDF/DDF FPGA 3 ch6 acket addr: 0xf2
2021-Apr-01+02:54:09.327 card 3-cpu0: [12345678.123789] DF2 Complex-0 Program DDF2 CAF_DF1_PROG_ERR error detected on SAD1234567
The sessmgr crashes with dhdr.bdh_magic can be observed here:
Assertion failure at midplane/libsn_midplane.c:1845
Function: sn_midplane_dma_buffer_packet_get()
Expression: packet->dhdr.bdh_magic == 0x1974
Proclet: sessmgr (f=87000,i=40)
Process: card=3 cpu=1 arch=X pid=40961 cpu=~18% argv0=sessmgr
Crash time: 2021-Apr-28+14:54:10 UTC
Recent errno: 11 Resource temporarily unavailable
Build_number: 76955
Stack (2680@0x0xffd28000):
[ffffe430/X] __kernel_vsyscall() sp=0xffd28378
[0d0d4c67/X] sn_assert() sp=0xffd283d8
[0d1cef88/X] sn_midplane_dma_buffer_packet_get() sp=0xffd28478
[06b85352/X] sessmgr_med_data_receive() sp=0xffd284f8
[0d15cca4/X] sn_epoll_run_events() sp=0xffd28548
[0d16979a/X] sn_loop_run() sp=0xffd289f8
[0ce5bc25/X] main() sp=0xffd28a68
Solution
The DDF reload fixes the CRC_ERROR issue and no further actions are required usually. Rarely the subscriber impact is reported after the DDF reload in such a case the manual card migration fixes the traffic issue.
# card migrate from <affected card> to <standby card>
In release 21.19 (or later) additional feature is introduced, the feature monitors the internal pipeline of the FPGA and triggers recovery if any issues after the DDF reload are detected.