Introduction
This document describes fixing split-brain problems in Cisco Adaptive Security Appliance failover or Firepower Threat Defence High Availability Pairs.
Prerequisites
Requirements
Cisco recommends that you have knowledge about how ASA/FTD High Availability Pair (failover) works - About Failover.
Components Used
This document is not restricted to specific software or hardware versions and applies to all supported ASA/FTD deployments in failover.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Conventions
For more information on document conventions, refer to the Cisco Technical Tips Conventions.
What is Split-Brain?
Split-brain is a scenario in which the units of an ASA/FTD HA are unable to detect each other on the network and hence both take the active role. This causes both the units to have the same interface IP Address and MAC Address and can cause severe inconsistencies in your network resulting in loss of services.
To identify if your HA is in split-brain, run the command show failover state on both the units and check if both boxes are active.
An Example of a Split-Brain
Primary Unit:
ciscoasa1/act/pri# show failover state
State Last Failure Reason Date/Time
This host - Primary
Active None
Other host - Secondary
Failed Comm Failure 02:39:43 UTC Jan 10 2022
====Configuration State===
Sync Done - STANDBY
====Communication State==
Secondary unit:
ciscoasa2/act/sec# show failover state
State Last Failure Reason Date/Time
This host - Secondary
Active None
Other host - Primary
Failed Comm Failure 02:39:40 UTC Jan 10 2022
====Configuration State===
Sync Done
Sync Done - STANDBY
====Communication State==
Split-brain can cause an outage if the MAC address learned for the Active IP Addresses on the connected devices are not all of the same units. For example, consider the network topology:
Lab Topology
VMAC’s have been assigned to the interface as shown. This has been done to make the mac address-table easy to understand:
Inside (G0/2) : Active MAC - 00c1.1000.aaaa
Standby MAC - 00c1.1000.bbbb
Outside (G0/4) : Active MAC - 00c1.2000.aaaa
Standby MAC - 00c1.2000.bbbb
Note: If VMAC’s are not configured, the Active device always takes the MAC for the Primary unit interface and standby takes the Secondary MAC.
MAC Address Table on Switch when HA is healthy:
Switch#show mac address-table
Mac Address Table
-------------------------------------------
Vlan Mac Address Type Ports
---- ----------- -------- -----
100 00c1.1000.aaaa DYNAMIC Gi1/0/5
100 00c1.1000.bbbb DYNAMIC Gi1/0/1
300 00c1.64bc.c508 DYNAMIC Gi1/0/4
300 00d7.8f38.8424 DYNAMIC Gi1/0/8
200 00c1.2000.aaaa DYNAMIC Gi1/0/7
200 00c1.2000.bbbb DYNAMIC Gi1/0/3
If the failover link fails, the active unit shall stay active and the standby remains standby. When a unit does not receive three consecutive HELLO messages on the failover link, the unit sends LANTEST messages on each data interface, including the failover link, to validate whether or not the peer is responsive. The action that the ASA takes depends on the response from the other unit.
Possible actions are:
- If the ASA receives a response on the failover link, then it does not failover.
- If the ASA does not receive a response on the failover link, but it does receive a response on a data interface, then the unit does not failover. The failover link is marked as failed. You can restore the failover link as soon as possible because the unit cannot failover to standby while the failover link is down.
- If the ASA does not receive a response on any interface, then the standby unit switches to active mode and classifies the other unit as failed. This leads to a split-brain scenario.
At this stage, all data interfaces on both the firewalls acts like they are the active unit. So, interfaces on the active and standby firewall use the same IP and MAC address. This leads to an inconsistent MAC address table due to poison arp entry and hence can cause an outage.
Note: failover link is responsible for the communication of this data between the failover pair: Unit State (active/standby), Hello messages, Network Link status, MAC Address exchange, Config Replication, and Sync.
How to Proactively Prepare Against Failover Issues
To proactively prepare against a split-brain condition:
- Be on the Cisco Recommended Golden Release - Under certain conditions, split-brain can also be caused due to issues like a memory leak. Cisco Recommended releases significantly reduce your exposure to such situations.
- Network Topology - It is recommended that the data interfaces and the failover links have different paths to decrease the chance of all interfaces failing at the same time.
- Use a port-channel interface for the failover interface - If you have unused interfaces on your firewall, pair them to form a port-channel and use it as the failover link, this increases link reliability and remove a Single Point of Failure (SPOF).
- Ensure failover interface does not have too much latency - As per the ASA Config Guide "For optimum performance when using long distance failover, the latency for the state link can be less than 10 milliseconds and no more than 250 milliseconds. If latency is more than 10 milliseconds, some performance degradation occurs due to retransmission of failover messages."
- Adjust Poll Timer/Hold Timer values as per your deployment - There is no one size fits all approach to failover timers. In general, when you low a timer, it can cause unnecessary failovers (especially if there is some latency), and too high a value can lead to increased time for a failover to occur. This leads to noticeable failovers. Hold Timer value must be 5x Poll Timer value.
- Configuring a Virtual MAC Address for interfaces - Under a condition where "the secondary unit boots without detecting the primary unit, then the secondary unit becomes the active unit and uses its own MAC addresses because it does not know the primary unit MAC addresses. When the primary unit becomes available, the secondary (active) unit changes the MAC addresses to those of the primary unit, which can cause an interruption in your network traffic. Similarly, if you swap out the primary unit with new hardware, a new MAC address is used."
Virtual MAC addresses guard against this disruption, because the active MAC addresses are known to the secondary unit at startup, and remain the same in the case of new primary unit hardware. If you do not configure virtual MAC addresses, you need to clear the ARP tables on connected routers to restore traffic flow". For more details Refer - MAC Addresses and IP Addresses in Failover.
- Send ASA/FTD Logs for both the units to an external Syslog server - This step is more for the serviceability of issues.
Possible Reasons for Split-Brain
As already mentioned, split-brain occurs when the communication between the failover link interfaces is down (unidirectionally or bidirectionally). The most common reasons are:
Procedure to Troubleshoot - Flowchart
In order to troubleshoot and resolve a split-brain Scenario, use this flowchart, start at the box marked Main. There are some problems that are not resolvable here. In these cases, links are provided to Cisco Technical Support. In order to open a service request, you must have a valid service contract.
Note: In FTD Deployments, follow the steps in this chart from “system support diagnostics-cli”.
Troubleshooting Flow Chart
Emergency Recovery from Split-Brain
To recover your network from a split-brain, you need to ensure that traffic hits only one of the two firewalls; that is, the MAC addresses learned for the Active IPs all point to a single unit. To do this, you can disable failover on the unit or cut it off the network entirely.
- Disable Failover on the unit not passing traffic:
- On ASA Platform, over CLI, navigate to the configuration terminal and enter no failover command.
- On FTD Platform, over Clish mode, enter configure high-availability suspend command.
- For ASA, shut the data interfaces. For FTD, shut the interfaces on the connected device. Alternatively, you can also physically disconnect the interfaces. Also, you can power off the device, but this limits you from managing the device. Refer to your device config guide on the steps to do this.
Note: If you notice connectivity issues even after you perform the mentioned step(s), it is likely that the connected device(s) have stale arp entries. Check arp entries on upstream and downstream devices. To fix the issue you can either flush these or force the working ASA/FTD to send a garp packet for the interface IP that has the issue. To do this, run command in enable mode (for FTD in System supports diagnostics-cli) - debug menu ipaddrutl 6 <interface ip address>.
Caution: In case you open a support ticket with TAC for split-brain related issues, please share the information mentioned under section Data to be Collected for TAC Service Request in this document.
Data to be Shared with TAC
Please share mentioned data in case you need to open a TAC Service Request.
- Topology diagram that shows ASA/FTD-HA and its physical connections with neighboring devices (Including Failover Interfaces).
- Output for show tech-support on ASA or Troubleshooting File on Platforms running FTD.
- Syslogs along with timestamps for +/- 5 minutes when the issue occurred.
- FXOS Troubleshooting files, if the hardware is an FPR appliance.
To generate Troubleshooting Files for FTD or FXOS, please refer to Firepower Troubleshoot File Generation Procedures. Open a TAC SR.