The Disk Usage health module compares disk usage on a managed device’s hard drive and malware storage pack to the limits configured
for the module and alerts when usage exceeds the percentages configured for the module. This module also alerts when the system
excessively deletes files in monitored disk usage categories, or when disk usage excluding those categories reaches excessive
levels, based on module thresholds.
This topic describes the symptoms and troubleshooting guidelines for two health alerts generated by the Disk Usage health
module:
The disk manager process manages the disk usage of a device. Each type of file monitored by the disk manager is assigned a
silo. Based on the amount of disk space available on the system the disk manager computes a High Water Mark (HWM) and a Low
Water Mark (LWM) for each silo.
To display detailed disk usage information for each part of the system, including silos, LWMs, and HWMs, use the show disk-manager command.
Examples
Following is an example of the disk manager information.
> show disk-manager
Silo Used Minimum Maximum
Temporary Files 0 KB 499.197 MB 1.950 GB
Action Queue Results 0 KB 499.197 MB 1.950 GB
User Identity Events 0 KB 499.197 MB 1.950 GB
UI Caches 4 KB 1.462 GB 2.925 GB
Backups 0 KB 3.900 GB 9.750 GB
Updates 0 KB 5.850 GB 14.625 GB
Other Detection Engine 0 KB 2.925 GB 5.850 GB
Performance Statistics 33 KB 998.395 MB 11.700 GB
Other Events 0 KB 1.950 GB 3.900 GB
IP Reputation & URL Filtering 0 KB 2.437 GB 4.875 GB
Archives & Cores & File Logs 0 KB 3.900 GB 19.500 GB
Unified Low Priority Events 1.329 MB 4.875 GB 24.375 GB
RNA Events 0 KB 3.900 GB 15.600 GB
File Capture 0 KB 9.750 GB 19.500 GB
Unified High Priority Events 0 KB 14.625 GB 34.125 GB
IPS Events 0 KB 11.700 GB 29.250 GB
Health Alert Format
When the Health Monitor process on the management center runs (once every 5 minutes or when a manual run is triggered) the Disk Usage module looks into the diskmanager.log file and,
if the correct conditions are met, the respective health alert is triggered.
The structures of these health alerts are as follows:
For example,
It’s possible for any silo to generate a Frequent drain of <SILO NAME> health alert. However, the most commonly seen are the alerts related to events. Among the event silos, the Low Priority Events are often seen because these type of events are generated by the device more frequently.
A Frequent drain of <SILO NAME> event has a Warning severity level when seen in relation to an event-related silo, because events will be queued to be sent to the management center. For a non-event related silo, such as the Backups silo, the alert has a Critical severity level because this information is lost.
Important
|
Only event silos generate a Drain of unprocessed events from <SILO NAME> health alert. This alert always has Critical severity level.
|
Additional symptoms besides the alerts can include:
Common Troubleshoot Scenarios
A Frequent drain of <SILO NAME> event is caused by too much input into the silo for its size. In this case, the disk manager drains (purges) that file at
least twice in the last 5-minute interval. In an event type silo, this is typically caused by excessive logging of that event
type.
In the case of a Drain of unprocessed events of <SILO NAME> health alert, this can also be caused by a bottleneck in the event processing path.
There are three potential bottlenecks with respect to these Disk Usage alerts:
-
Excessive logging ― The EventHandler process on threat
defense is oversubscribed (it reads slower than what Snort writes).
-
Sftunnel bottleneck ― The Eventing interface is unstable or oversubscribed.
-
SFDataCorrelator bottleneck ― The data transmission channel between the management center and the managed device is oversubscribed.
Excessive Logging
One of the most common causes for the health alerts of this type is excessive input. The difference between the Low Water
Mark (LWM) and High Water Mark (HWM) gathered from the show disk-manager command shows how much space there is available to take on that silo to go from LWM (freshly drained) to the HWM value. If
there are frequent drain of events (with or without unprocessed events) the first thing to review is the logging configuration.
-
Check for double logging ― Double logging scenarios can be identified if you look at the correlator perfstats on the management center:
admin@FMC:~$ sudo perfstats -Cq < /var/sf/rna/correlator-stats/now
-
Check logging settings for the ACP ― Review the logging settings of the Access Control Policy (ACP). If logging both "Beginning"
and "End" of connection, log only the end as it will include everything included when the beginning is logged as well as reduce
the amount of events.
Communications Bottleneck ― Sftunnel
Sftunnel is responsible for encrypted communications between the management center and the managed device. Events are sent over the tunnel to the management center. Connectivity issues and/or instability in the communication channel (sftunnel) between the managed device and the management center can be due to:
-
Sftunnel is down or is unstable (flaps).
Ensure that the management center and the managed device have reachability between their management interfaces on TCP port 8305.
The sftunnel process should be stable and should not restart unexpectedly. Verify this by checking the /var/log/message file and search for messages that contain the sftunneld string.
-
Sftunnel is oversubscribed.
Review trend data from the Heath Monitor and look for signs of oversubscription of the management center's management interface, which can be a spike in management traffic or a constant oversubscription.
Use as a secondary management interface for Firepower-eventing. To use this interface, you must configure its IP address and
other parameters at the threat
defense CLI using the configure network management-interface command.
Communications Bottleneck ― SFDataCorrelator
The SFDataCorrelator manages data transmission between the management center and the managed device; on the management center, it analyzes binary files created by the system to generate events, connection data, and network maps. The first step is
to review the diskmanager.log file for important information to be gathered, such as:
-
The frequency of the drain.
-
The number of files with Unprocessed Events drained.
-
The occurrence of the drain with Unprocessed Events.
Each time the disk manager process runs it generates an entry for each of the different silos on its own log file, which is
located under [/ngfw]/var/log/diskmanager.log. Information gathered from the diskmanager.log (in CSV format) can be used to help narrow the search for a cause.
Additional troubleshooting steps:
-
The command stats_unified.pl can help you to determine if the managed device does have some data which needs to be sent to management center. This condition can happen when the managed device and the management center experience a connectivity issue. The managed device stores the log data onto a hard drive.
admin@FMC:~$ sudo stats_unified.pl
-
The manage_proc.pl command can reconfigure the correlator on the management center side.
root@FMC:~# manage_procs.pl
Before You Contact Cisco Technical Assistance Center (TAC)
It is highly recommended to collect these items before you contact Cisco TAC:
-
Screenshots of the health alerts seen.
-
Troubleshoot file generated from the management center.
-
Troubleshoot file generated from the affected managed device.
Date and Time when the problem was first seen.
-
Information about any recent changes done to the policies (if applicable).
The output of the stats_unified.pl command as described in the Communications Bottleneck ― SFDataCorrelator.