Introduction
This document describes ACI Faults F3545/F3544 and the possible steps in order to help mitigate the issue.
F3545: Policy CAM Programming for Contracts
The fault F3545 occurs when the switch fails to activate a contract rule (zoning rule) due to either a hardware or software programming failure. If you see this, it is because the policy Content Addressable Memory (CAM) is full and no more contracts can be deployed on the switch, and a different set of contracts can deploy after a reboot or upgrade.
This can lead to services that used to work before an upgrade beginning to fail after an upgrade or a switch clean reload.
Note that the same fault can occur for other reasons, such as an unsupported type of filter in the contract(s) instead of policy CAM usage. For instance, first-generation ACI switches support EtherType IP but not IPv4 or IPv6 in contract filters.
When this fault is present, check the Operations > Capacity Dashboard > Leaf Capacity
in the APIC GUI for policy CAM usage. You can also execute this command on the leaf in order to get the current policy count.
vsh_lc -c "show plat internal hal health-stats" | grep -A 7 "Policy stats"
You can also run the moquery (moquery -c faultInst -f 'fault.Inst.code=="F3545"') on the CLI of any APIC in order to check if these faults exist on the system. The faults are visible within the GUI as well.
Fault Example (F3545: Zoning Rule Programming Failure)
The next output shows an example of node 101 with programming failure for 266 contract rules (zoneRuleFailed). Although it also shows the programming failure of L3Out subnets (pfxRuleFailed) in the changeSet, a separate fault F3544 is raised for that.
apic1# moquery -c faultInst -f 'fault.Inst.code=="F3545"'
Total Objects shown: 1
# fault.Inst
code : F3545
ack : no
annotation :
cause : actrl-resource-unavailable
changeSet : pfxRuleFailed (New: 80), zoneRuleFailed (New: 266)
childAction :
created : 2020-02-26T01:01:49.256-05:00
delegated : no
descr : 266 number of Rules failed on leaf1
dn : topology/pod-1/node-101/sys/actrl/dbgStatsReport/fault-F3545
domain : infra
extMngdBy : undefined
highestSeverity : major
lastTransition : 2020-02-26T01:03:59.849-05:00
lc : raised
modTs : never
occur : 1
origSeverity : major
prevSeverity : major
rn : fault-F3545
rule : actrl-stats-report-zone-rule-prog-failed
severity : major
status :
subject : hwprog-failed
type : operational
uid :
F3544: L3Out Subnets Programming for Contracts (F3544)
The fault F3544 occurs when the switch fails to activate an entry in order to map a prefix to pcTag due to either a hardware or software programming failure.
These entries are configured for L3Out subnets with the 'External Subnets for the External EPG' scope under an external EPG in an L3Out and used to map L3Out subnets to L3Out EPGs.
If you see this because of the LPM or host routes capacity on the switch, such a switch can activate different sets of entries after a reboot or upgrade.
This can result in services that were functional before an upgrade failing once the upgrade is initiated or after a clean reload of the switch.
When this fault is present, check the Operations > Capacity Dashboard > Leaf Capacity
in the APIC GUI for LPM and /32 or /128 routes usage.
You can also execute this command on the leaf in order to get the current policy count:
vsh_lc -c "show plat internal hal health-stats" | grep -A 55 "L3 stats"
You can run the moquery (moquery -c faultInst -f 'fault.Inst.code=="F3544"') on the CLI of any APIC in order to check if these faults exist on the system. The faults are visible within the GUI as well.
Fault Example (F3544: L3Out Subnet Programming Failure)
The next output shows an example of node 101 with programming failure for 80 L3Out subnets with 'External Subnets for the External EPG' (pfxRuleFailed). Although it also shows the programming failure of contracts themselves (zoneRuleFailed) in the changeSet, a separate fault F3545 is raised for that.
apic1# moquery -c faultInst -f 'fault.Inst.code=="F3544"'
Total Objects shown: 1
# fault.Inst
code : F3544
ack : no
annotation :
cause : actrl-resource-unavailable
changeSet : pfxRuleFailed (New: 80), zoneRuleFailed (New: 266)
childAction :
created : 2020-02-26T01:01:49.246-05:00
delegated : no
descr : 80 number of Prefix failed on leaf1
dn : topology/pod-1/node-101/sys/actrl/dbgStatsReport/fault-F3544
domain : infra
extMngdBy : undefined
highestSeverity : major
lastTransition : 2020-02-26T01:03:59.849-05:00
lc : raised
modTs : never
occur : 1
origSeverity : major
prevSeverity : major
rn : fault-F3544
rule : actrl-stats-report-pre-fix-prog-failed
severity : major
status :
subject : hwprog-failed
type : operational
uid :
How to address the faults?
Note: Do not reload the switch in this state.
1. Verify Policy CAM usage/LPM on the Operations Capacity Dashboard or using leaf CLI vsh_lc -c "show plat internal hal health-stats"
.
2. Check the Verified Scalability Guide (google Verified Scalability Guide ACI) in order to check the supported limits for the version and platform.
3. Remove the unused contracts and filters applied to EPGs.
4. Collect an on-demand techsupport which includes the leaf switches for further analysis by TAC.
Pre-Upgrade Check
Both faults F3545 and F3544 are flagged by the pre-upgrade validator script in order to caution the user of the impact if the faults are not addressed prior to the upgrade.
Details of the pre-upgrade script are documented here:
https://www.cisco.com/c/en/us/td/docs/dcn/aci/apic/all/apic-installation-aci-upgrade-downgrade/Cisco-APIC-Installation-ACI-Upgrade-Downgrade-Guide/m-pre-upgrade-checklists.html?bookSearch=true#Cisco_Concept.dita_1f674dd5-9ea2-4062-826b-f3c1550552dc.