Troubleshooting APIC Crash Scenarios
Cluster Troubleshooting Scenarios
Problem | Solution | ||
---|---|---|---|
An APIC node fails within the cluster. For example, node 2 of a cluster of 5 APICs fails. |
There are two available solutions:
|
||
A new APIC connects to the fabric and loses connection to a leaf switch. |
Use the following commands to check for an infra (infrastructure) VLAN mismatch:
If the output of these commands shows different VLANs, the new APIC is not configured with the correct infra (infrastructure) VLAN. To correct this issue, follow these steps:
|
||
Two APICs cannot communicate after a reboot. |
The issue can occur after the following sequence of events:
In this scenario, APIC1a discovers APIC2, but APIC2 is unavailable because it is in a cluster with APIC1, which appears to be offline. As a result, APIC1a does not accept messages from APIC2. To resolve the issue, decommission APIC1 on APIC2, and commission APIC1 again. |
||
A decommissioned APIC joins a cluster. |
The issue can occur after the following sequence of events:
To resolve the issue, decommission the APIC after the cluster recovers. |
||
Mismatched ChassisID following reboot. | The issue occurs when an
APIC
boots with a ChassisID different from the ChassisID registered in the cluster.
As a result, messages from this
APIC
are discarded.
To resolve the issue, ensure that you decommission the APIC before rebooting. |
||
The APIC displays faults during changes to cluster size. |
A variety of conditions can prevent a cluster from extending the OperationalClusterSize to meet the AdminstrativeClusterSize. For more information, inspect the fault and review the "Cluster Faults" section in the Cisco APIC Basic Configuration Guide. |
||
An APIC is unable to join a cluster. |
The issue occurs when two APICs are configured with the same ClusterID when a cluster expands. As a result, one of the two APICs cannot join the cluster and displays an expansion-contender-chassis-id-mismatch fault. To resolve the issue, configure the APIC outside the cluster with a new cluster ID. |
||
APIC unreachable in cluster. |
Check the following settings to diagnose the issue:
|
||
Cluster does not expand. |
The issue occurs under the following circumstances:
|
||
An APIC is down. |
|
Cluster Faults
The APIC supports a variety of faults to help diagnose cluster problems. The following sections describe the two major cluster fault types.
Discard Faults
The APIC discards cluster messages that are not from a current cluster peer or cluster expansion candidate. If the APIC discards a message, it raises a fault that contains the originating APIC's serial number, cluster ID, and a timestamp. The following table summarizes the faults for discarded messages:
Fault | Meaning |
---|---|
expansion-contender-chassis-id-mismatch | The ChassisID of the transmitting APIC does not match the ChassisID learned by the cluster for expansion. |
expansion-contender-fabric-domain-mismatch | The FabricID of the transmitting APIC does not match the FabricID learned by the cluster for expansion. |
expansion-contender-id-is-not-next-to-oper-cluster-size | The transmitting APIC has an inappropriate cluster ID for expansion. The value should be one greater than the current OperationalClusterSize. |
expansion-contender-message-is-not-heartbeat | The transmitting APIC does not transmit continuous heartbeat messages. |
fabric-domain-mismatch | The FabricID of the transmitting APIC does not match the FabricID of the cluster. |
operational-cluster-size-distance-cannot-be-bridged | The transmitting APIC has an OperationalClusterSize that is different from that of the receiving APIC by more than 1. The receiving APIC rejects the request. |
source-chassis-id-mismatch | The ChassisID of the transmitting APIC does not match the ChassisID registered with the cluster. |
source-cluster-id-illegal | The transmitting APIC has a clusterID value that is not permitted. |
source-has-mismatched-target-chassis-id | The target ChassisID of the transmitting APIC does not match the Chassis ID of the receiving APIC. |
source-id-is-outside-operational-cluster-size | The transmitting APIC has a cluster ID that is outside of the OperationalClusterSize for the cluster. |
source-is-not-commissioned | The transmitting APIC has a cluster ID that is currently decommissioned in the cluster. |
Cluster Change Faults
The following faults apply when there is an error during a change to the APIC cluster size.
Fault | Meaning |
---|---|
cluster-is-stuck-at-size-2 | This fault is issued if the OperationalClusterSize remains at 2 for an extended period. To resolve the issue, restore the cluster target size. |
most-right-appliance-remains-commissioned | The last APIC within a cluster is still in service, which prevents the cluster from shrinking. |
no-expansion-contender | The cluster cannot detect an APIC with a higher cluster ID, preventing the cluster from expanding. |
service-down-on-appliance-carrying-replica-related-to-relocation | The data subset to be relocated has a copy on a service that is experiencing a failure. Indicates that there are multiple such failures on the APIC. |
unavailable-appliance-carrying-replica-related-to-relocation | The data subset to be relocated has a copy on an unavailable APIC. To resolve the fault, restore the unavailable APIC. |
unhealthy-replica-related-to-relocation | The data subset to be relocated has a copy on an APIC that is not healthy. To resolve the fault, determine the root cause of the failure. |
APIC Unavailable
The following cluster faults can apply when an APIC is unavailable:
Fault | Meaning |
---|---|
fltInfraReplicaReplicaState | The cluster is unable to bring up a data subset. |
fltInfraReplicaDatabaseState | Indicates a corruption in the data store service. |
fltInfraServiceHealth | Indicates that a data subset is not fully functional. |
fltInfraWiNodeHealth | Indicates that an APIC is not fully functional. |
Troubleshooting Fabric Node and Process Crash
The ACI switch node has numerous processes which control various functional aspects on the system. If the system has a software failure in a particular process, a core file will be generated and the process will be reloaded.
If the process is a Data Management Engine (DME) process, the DME process will restart automatically. If the process is a non-DME process, it will not restart automatically and the switch will reboot to recover.
This section presents an overview of the various processes, how to detect that a process has cored, and what actions should be taken when this occurs
DME Processes
The essential processes running on an APIC can be found through the CLI. Unlike the APIC, the processes that can be seen via the GUI in
shows all processes running on the leaf.Through the ps-ef | grep svc_ifc:
rtp_leaf1# ps -ef |grep svc_ifc
root 3990 3087 1 Oct13 ? 00:43:36 /isan/bin/svc_ifc_policyelem --x
root 4039 3087 1 Oct13 ? 00:42:00 /isan/bin/svc_ifc_eventmgr --x
root 4261 3087 1 Oct13 ? 00:40:05 /isan/bin/svc_ifc_opflexelem --x -v
dptcp:8000
root 4271 3087 1 Oct13 ? 00:44:21 /isan/bin/svc_ifc_observerelem --x
root 4277 3087 1 Oct13 ? 00:40:42 /isan/bin/svc_ifc_dbgrelem --x
root 4279 3087 1 Oct13 ? 00:41:02 /isan/bin/svc_ifc_confelem --x
rtp_leaf1#
Each of the processes running on the switch writes activity to a log file on the system. These log files are bundled as part of the techsupport file but can be found via CLI access in /tmp/logs/ directory. For example, the Policy Element process log output is written into /tmp/logs/svc_ifc_policyelem.log.
The following is a brief description of the DME processes running on the system. This can help in understanding which log files to reference when troubleshooting a particular process or understand the impact to the system if a process crashed:
Process |
Function |
---|---|
policyelem |
Policy Element: Process logical MO from APIC and push concrete model to the switch |
eventmgr |
Event Manager: Processes local faults, events, health score |
opflexelem |
Opflex Element: Opflex server on switch |
observerelem |
Observer Element: Process local stats sent to APIC |
dbgrelem |
Debugger Element: Core handler |
nginx |
Web server handling traffic between the switch and APIC |
Identify When a Process Crashes
When a process crashes and a core file is generated, a fault as well as an event is generated. The fault for the particular process is shown as a "process-crash" as shown in this syslog output from the APIC:
Oct 16 03:54:35 apic3 %LOG_LOCAL7-3-SYSTEM_MSG [E4208395][process-crash][major]
[subj-[dbgs/cores/node-102-card-1-svc-policyelem-ts-2014-10-16T03:54:55.000+00:00]/
rec-12884905092]Process policyelem cored
When the process on the switch crashes, the core file is compressed and copied to the APIC. The syslog message notification comes from the APIC.
The fault that is generated when the process crashes is cleared when the process is Troubleshooting Cisco Application Centric Infrastructure 275 restarted. The fault can be viewed via the GUI in the fabric history tab at
.Collecting the Core Files
The APIC GUI provides a central location to collect the core files for the fabric nodes.
An export policy can be created from
. However, there is a default core policy where files can be downloaded directly.The core files can be accessed via SSH/SCP through the APIC at /data/techsupport on the APIC where the core file is located. Note that the core file will be available at /data/ techsupport on one APIC in the cluster, the exact APIC that the core file resides can be found by the Export Location path as shown in the GUI. For example, if the Export Location begins with “files/3/”, the file is located on node 3 (APIC3).
APIC Process Crash Verification and Restart
Symptom 1
Process on switch fabric crashes. Either the process restarts automatically or the switch reloads to recover.
-
Verification:
As indicated in the overview section, if a DME process crashes, it should restart automatically without the switch restarting. If a non-DME process crashes, the process will not automatically restart and the switch will reboot to recover.
Depending on which process crashes, the impact of the process core will vary.
When a non-DME process crashes, this will typical lead to a HAP reset as seen on the console:
[ 1130.593388] nvram_klm wrote rr=16 rr_str=ntp hap reset to nvram [ 1130.599990] obfl_klm writing reset reason 16, ntp hap reset [ 1130.612558] Collected 8 ext4 filesystems
-
Check Process Log:
The process which crashes should have at some level of log output prior to the crash. The output of the logs on the switch are written into the /tmp/logs directory. The process name will be part of the file name. For example, for the Policy Element process, the file is svc_ifc_policyelem.log
rtp_leaf2# ls -l |grep policyelem -rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log -rw-r--r-- 1 root root 1413246 Oct 14 22:10 svc_ifc_policyelem.log.1.gz -rw-r--r-- 1 root root 1276434 Oct 14 22:15 svc_ifc_policyelem.log.2.gz -rw-r--r-- 1 root root 1588816 Oct 14 23:12 svc_ifc_policyelem.log.3.gz -rw-r--r-- 1 root root 2124876 Oct 15 14:34 svc_ifc_policyelem.log.4.gz -rw-r--r-- 1 root root 1354160 Oct 15 22:30 svc_ifc_policyelem.log.5.gz -rw-r--r-- 2 root root 13767569 Oct 16 00:37 svc_ifc_policyelem.log.6 -rw-rw-rw- 1 root root 2 Oct 14 22:06 svc_ifc_policyelem.log.PRESERVED -rw-rw-rw- 1 root root 209 Oct 14 22:06 svc_ifc_policyelem.log.stderr rtp_leaf2#
There will be several files for each process located at /tmp/logs. As the log file increases in size, it will be compressed and older log files will be rotated off. Check the core file creation time (as shown in the GUI and the core file name) to understand where to look in the file. Also, when the process first attempts to come up, there be an entry in the log file that indicates “Process is restarting after a crash” that can be used to search backwards as to what might have happened prior to the crash.
-
Check Activity:
A process which has been running has had some change which then caused it to crash. In many cases the changes may have been some configuration activity on the system. What activity occurred on the system can be found in the audit log history of the system.
-
Contact TAC:
A process crashing should not normally occur. In order to understand better why beyond the above steps it will be necessary to decode the core file. At this point, the file will need to be collected and provided to the TAC for further processing.
Collect the core file (as indicated above how to do this) and open up a case with the TAC.
Symptom 2
Fabric switch continuously reloads or is stuck at the BIOS loader prompt.
-
Verification:
If a DME process crashes, it should restart automatically without the switch restarting. If a non-DME process crashes, the process will not automatically restart and the switch will reboot to recover. However in either case if the process continuously crashes, the switch may get into a continuous reload loop or end up in the BIOS loader prompt.
[ 1130.593388] nvram_klm wrote rr=16 rr_str=policyelem hap reset to nvram [ 1130.599990] obfl_klm writing reset reason 16, policyelem hap reset [ 1130.612558] Collected 8 ext4 filesystems
-
Break the HAP Reset Loop:
First step is to attempt to get the switch back into a state where further information can be collected.
If the switch is continuously rebooting, when the switch is booting up, break into the BIOS loader prompt through the console by typing CTRL C when the switch is first part of the boot cycle.
Once the switch is at the loader prompt, enter in the following commands:
-
cmdline no_hap_reset
-
boot
The cmdline command will prevent the switch from reloading with a hap reset is called. The second command will boot the system. Note that the boot command is needed instead of a reload at the loader as a reload will remove the cmdline option entered.
Though the system should now remain up to allow better access to collect data, whatever process is crashing will impact the functionality of the switch.
-
As in the previous table, check the process log, activity, and contact TAC steps.
Troubleshooting an APIC Process Crash
The APIC has a series of Data Management Engine (DME) processes which control various functional aspects on the system. When the system has a software failure in a particular process, a core file will be generated and the process will be reloaded.
The following sections cover potential issues involving system processes crashes or software failures, beginning with an overview of the various system processes, how to detect that a process has cored, and what actions should be taken when this occurs. The displays taken on a working healthy system can then be used to identify processes that may have terminated abruptly.
DME Processes
The essential processes running on an APIC can be found either through the GUI or the CLI. Using the GUI, the processes and the process ID running is found in
.Using the CLI, the processes and the process ID are found in the summary file at /aci/ system/controllers/1/processes (for APIC1):
admin@RTP_Apic1:processes> cat summary
processes:
process-id process-name max-memory-allocated state
---------- ----------------- -------------------- -------------------
0 KERNEL 0 interruptible-sleep
331 dhcpd 108920832 interruptible-sleep
336 vmmmgr 334442496 interruptible-sleep
554 neo 398274560 interruptible-sleep
1034 ae 153690112 interruptible-sleep
1214 eventmgr 514793472 interruptible-sleep
2541 bootmgr 292020224 interruptible-sleep
4390 snoopy 28499968 interruptible-sleep
5832 scripthandler 254308352 interruptible-sleep
19204 dbgr 648941568 interruptible-sleep
21863 nginx 4312199168 interruptible-sleep
32192 appliancedirector 136732672 interruptible-sleep
32197 sshd 1228800 interruptible-sleep
32202 perfwatch 19345408 interruptible-sleep
32203 observer 724484096 interruptible-sleep
32205 lldpad 1200128 interruptible-sleep
32209 topomgr 280576000 interruptible-sleep
32210 xinetd 99258368 interruptible-sleep
32213 policymgr 673251328 interruptible-sleep
32215 reader 258940928 interruptible-sleep
32216 logwatch 266596352 interruptible-sleep
32218 idmgr 246824960 interruptible-sleep
32416 keyhole 15233024 interruptible-sleep
admin@apic1:processes>
Each of the processes running on the APIC writes to a log file on the system. These log files can be bundled as part of the APIC techsupport file but can also be observed through SSH shell access in /var/log/dme/log. For example, the Policy Manager process log output is written into /var/log/dme/log/svc_ifc_policymgr.bin.log.
The following is a brief description of the processes running on the system. This can help in understanding which log files to reference when troubleshooting a particular process or understand the impact to the system if a process crashed:
Process |
Function |
---|---|
KERNEL |
Linux kernel |
dhcpd |
DHCP process running for APIC to assign infra addresses |
vmmmgr |
Handles process between APIC and Hypervisors |
neo |
Shell CLI Interpreter |
ae |
Handles the state and inventory of local APIC appliance |
eventmgr |
Handles all events and faults on the system |
bootmgr |
Controls boot and firmware updates on fabric nodes |
snoopy |
Shell CLI help, tab command completion |
scripthandler |
Handles the L4-L7 device scripts and communication |
dbgr |
Generates core files when process crashes |
nginx |
Web service handling GUI and REST API access |
appliancedirector |
Handles formation and control of APIC cluster |
sshd |
Enabled SSH access into the APIC |
perfwatch |
Monitors Linux cgroup resource usage |
observer |
Monitors the fabric system and data handling of state, stats, health |
lldpad |
LLDP Agent |
topomgr |
Maintains fabric topology and inventory |