- Device Manager Help
- Configuring Cisco DCNM-SAN Server
- Configuring Authentication in Cisco DCNM-SAN
- Configuring Cisco DCNM-SAN Client
- Device Manager
- Configuring Performance Manager
- Configuring High Availability
- Configuring Trunking
- Configuring PortChannels
- Configuring N Port Virtualization
- Configuring Interfaces
- Configuration of Fibre Channel Interfaces
- Using the CFS Infrastructure
- Configuring SNMP
- Configuring Domain Parameters
- Configuring and Managing Zones
- Configuring FCoE
- Configuring Dense Wavelength Division Multiplexing
- Configuring and Managing VSANs
- Discovering SCSI Targets
- Configuring SAN Device Virtualization
- Configuring Fibre Channel Routing Services and Protocols
- Managing FLOGI, Name Server, FDMI, and RSCN Databases
- Configuring FICON
- Creating Dynamic VSANs
- Distributing Device Alias Services
- Configuring Advanced Fabric Features
- Configuring Users and Common Role
- Configuring Security Features on an External AAA Server
- Configuring Certificate Authorities and Digital Certificates
- Configuring FC-SP and DHCHAP
- Configuring Cisco TrustSec Fibre Channel Link Encryption
- Configuring FIPS
- Configuring IPv4 and IPv6 Access Control Lists
- Configuring IPsec Network Security
- Configuring Port Security
- Configuring Fabric Binding
- Configuring FCIP
- Configuring the SAN Extension Tuner
- Configuring iSCSI
- Configuring IP Services
- Configuring IP Storage
- Configuring IPv4 for Gigabit Ethernet Interfaces
- Configuring IPv6 for Gigabit Ethernet Interfaces
- Configuring SCSI Flow Services
- Configuring SCSI Flow Statistics
- Configuring Fibre Channel Write Acceleration
- Monitoring the Network
- Monitoring Performance
- Configuring Call Home
- Configuring System Message Logging
- Scheduling Maintenance Jobs
- Configuring RMON
- Configuring Fabric Configuration Server
- Monitoring Network Traffic Using SPAN
- Monitoring System Processes and Logs
- Configuring QoS
- Configuring Port Tracking
- Configuring FlexAttach Virtual pWWN
- Configuring Interface Buffers
- Verifying Ethernet Interfaces
- Information About System Processes and Logs
- Saving Cores
- Saving the Last Core to Bootflash
- First and Last Core
- Online System Health Management
- Loopback Test Configuration Frequency
- Loopback Test Configuration Frame Length
- Hardware Failure Action
- Performing Test Run Requirements
- Tests for a Specified Module
- Clearing Previous Error Reports
- Interpreting the Current Status
- On-Board Failure Logging
- Default Settings
- Core and Log Files
- Clearing the Core Directory
- Configuring System Health
- Task Flow for Configuring System Health
- Enabling System Health Initiation
- Configuring Loopback Test Configuration Frequency
- Cofiguring Loopback Test Configuration Frame Length
- Configuring Hardware Failure Action
- Performing Test Run Requirements
- Clearing Previous Error Reports
- Performing Internal Loopback Tests
- Performing External Loopback Tests
- Performing Serdes Loopbacks
- Configuring On-Board Failure Logging
- Verifying System Processes and Logs Configuration
- Displaying System Processes
- Displaying System Status
- Displaying Core Status
- Verifying First and Last Core Status
- Displaying System Health
- Verifying Loopback Test Configuration Frame Length
- Displaying OBFL for the Switch
- Displaying the OBFL for a Module
- Displaying OBFL Logs
- Displaying the Module Counters Information
- Additional References
- Feature History for System Processes and Logs
Monitoring System Processes and Logs
This chapter provides details on monitoring the health of the switch and includes the following sections:
Information About System Processes and Logs
This section includes the following topics:
- Saving Cores
- Saving the Last Core to Bootflash
- First and Last Core
- Online System Health Management
- Loopback Test Configuration Frequency
- Loopback Test Configuration Frame Length
- Hardware Failure Action
- Performing Test Run Requirements
- Tests for a Specified Module
- Clearing Previous Error Reports
- Interpreting the Current Status
- On-Board Failure Logging
Saving Cores
You can save cores (from the active supervisor module, the standby supervisor module, or any switching module) to an external CompactFlash (slot 0) or to a TFTP server in one of two ways:
- On demand—Copies a single file based on the provided process ID.
- Periodically—Copies core files periodically as configured by the user.
A new scheme overwrites any previously issued scheme. For example, if you perform another core log copy task, the cores are periodically saved to the new location or file.
Saving the Last Core to Bootflash
This last core dump is automatically saved to bootflash in the /mnt/pss/ partition before the switchover or reboot occurs. Three minutes after the supervisor module reboots, the saved last core is restored from the flash partition (/mnt/pss) back to its original RAM location. This restoration is a background process and is not visible to the user.
Tip The timestamp on the restored last core file displays the time when the supervisor booted up not when the last core was actually dumped. To obtain the exact time of the last core dump, check the corresponding log file with the same PID.
To view the last core information, enter the show cores command in EXEC mode.
To view the time of the actual last core dump, enter the show process log command in EXEC mode.
First and Last Core
The first and last core feature uses the limited system resource and retains the most important core files. Generally, the first core and the most recently generated core have the information for debugging and, the first and last core feature tries to retain the first and the last core information.
If the core files are generated from an active supervisor module, the number of core files for the service is defined in the service.conf file. There is no upper limit on the total number of core files in the active supervisor module.
To display the core files saved in the system, use the show cores command.
Online System Health Management
The Online Health Management System (OHMS) (system health) is a hardware fault detection and recovery feature. It ensures the general health of switching, services, and supervisor modules in any switch in the Cisco MDS 9000 Family.
The OHMS monitors system hardware in the following ways:
- The OHMS component running on the active supervisor maintains control over all other OHMS components running on the other modules in the switch.
- The system health application running in the standby supervisor module only monitors the standby supervisor module, if that module is available in the HA standby mode.
The OHMS application launches a daemon process in all modules and runs multiple tests on each module to test individual module components. The tests run at preconfigured intervals, cover all major fault points, and isolate any failing component in the MDS switch. The OHMS running on the active supervisor maintains control over all other OHMS components running on all other modules in the switch.
On detecting a fault, the system health application attempts the following recovery actions:
- Performs additional testing to isolate the faulty component.
- Attempts to reconfigure the component by retrieving its configuration information from persistent storage.
- If unable to recover, sends Call Home notifications, system messages and exception logs; and shuts down and discontinues testing the failed module or component (such as an interface).
- Sends Call Home and system messages and exception logs as soon as it detects a failure.
- Shuts down the failing module or component (such as an interface).
- Isolates failed ports from further testing.
- Reports the failure to the appropriate software component.
- Switches to the standby supervisor module, if an error is detected on the active supervisor module and a standby supervisor module exists in the Cisco MDS switch. After the switchover, the new active supervisor module restarts the active supervisor tests.
- Reloads the switch if a standby supervisor module does not exist in the switch.
- Provides CLI support to view, test, and obtain test run statistics or change the system health test configuration on the switch.
- Performs tests to focus on the problem area.
Each module is configured to run the test relevant to that module. You can change the default parameters of the test in each module as required.
Loopback Test Configuration Frequency
Loopback tests are designed to identify hardware errors in the data path in the module(s) and the control path in the supervisors. One loopback frame is sent to each module at a preconfigured frequency—it passes through each configured interface and returns to the supervisor module.
The loopback tests can be run at frequencies ranging from 5 seconds (default) to 255 seconds. If you do not configure the loopback frequency value, the default frequency of 5 seconds is used for all modules in the switch. Loopback test frequencies can be altered for each module.
Loopback Test Configuration Frame Length
Loopback tests are designed to identify hardware errors in the data path in the module(s) and the control path in the supervisors. One loopback frame is sent to each module at a preconfigured size—it passes through each configured interface and returns to the supervisor module.
The loopback tests can be run with frame sizes ranging from 0 bytes to 128 bytes. If you do not configure the loopback frame length value, the switch generates random frame lengths for all modules in the switch (auto mode). Loopback test frame lengths can be altered for each module.
Hardware Failure Action
The failure-action command controls the Cisco NX-OS software from taking any action if a hardware failure is determined while running the tests.
By default, this feature is enabled in all switches in the Cisco MDS 9000 Family—action is taken if a failure is determined and the failed component is isolated from further testing.
Failure action is controlled at individual test levels (per module), at the module level (for all tests), or for the entire switch.
Performing Test Run Requirements
Enabling a test does not guarantee that the test will run.
Tests on a specific interface or module only run if you enable system health for all of the following items:
Tip The test will not run if system health is disabled in any combination. If system health is disabled to run tests, the test status shows up as disabled.
Tip If the specific module or interface is enabled to run tests, but is not running the tests due to system health being disabled, then tests show up as enabled (not running).
Tests for a Specified Module
The system health feature in the NX-OS software performs tests in the following areas:
- Active supervisor’s in-band connectivity to the fabric.
- Standby supervisor’s arbiter availability.
- Bootflash connectivity and accessibility on all modules.
- EOBC connectivity and accessibility on all modules.
- Data path integrity for each interface on all modules.
- Management port’s connectivity.
- User-driven test for external connectivity verification, port is shut down during the test (Fibre Channel ports only).
- User-driven test for internal connectivity verification (Fibre Channel and iSCSI ports).
Clearing Previous Error Reports
You can clear the error history for Fibre Channel interfaces, iSCSI interfaces, an entire module, or one particular test for an entire module. By clearing the history, you are directing the software to retest all failed components that were previously excluded from tests.
If you previously enabled the failure-action option for a period of time (for example, one week) to prevent OHMS from taking any action when a failure is encountered and after that week you are now ready to start receiving these errors again, then you must clear the system health error status for each test.
Tip The management port test cannot be run on a standby supervisor module.
Interpreting the Current Status
The status of each module or test depends on the current configured state of the OHMS test in that particular module (see Table 56-1 ).
The status of each test in each module is visible when you display any of the show system health commands. See the “Displaying System Health” section.
On-Board Failure Logging
The Generation 2 Fibre Channel switching modules provide the facility to log failure data to persistent storage, which can be retrieved and displayed for analysis. This on-board failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. The information will help in post-mortem analysis of failed cards.
OBFL data is stored in the existing CompactFlash on the module. OBFL uses the persistent logging (PLOG) facility available in the module firmware to store data in the CompactFlash. It also provides the mechanism to retrieve the stored data.
The data stored by the OBFL facility includes the following:
- Time of initial power-on
- Slot number of the card in the chassis
- Initial temperature of the card
- Firmware, BIOS, FPGA, and ASIC versions
- Serial number of the card
- Stack trace for crashes
- CPU hog information
- Memory leak information
- Software error messages
- Hardware exception logs
- Environmental history
- OBFL specific history information
- ASIC interrupt and error statistics history
- ASIC register dumps
Default Settings
Table 56-2 lists the default system health and log settings.
Core and Log Files
This section includes the following topics:
Prerequisites
- Be sure to create any required directory before performing this task. If the directory specified by this task does not exist, the switch software logs a system message each time a copy cores is attempted.
To copy the core and log files on demand, follow this step:
Copies the core file with the process ID 7407 as coreSample in slot 0. |
||
Copies cores (if any) of a process with PID 1524 generated on slot 51 or slot 72 to the TFTP server at IPv4 address 1.1.1.1. Note You can also use IPv6 addresses to identify the TFTP server. |
To copy the core and log files periodically, follow these steps:
Clearing the Core Directory
Clearing the Core Directory
Use the clear cores command to clean out the core directory. The software clears all the core files and other cores present on the active supervisor module.
Prerequisites
To clear the cores on a switch, follow these steps:
Step 1 Click Clear to clear the cores.
The software keeps the last few cores, per service and per slot, and clears all the core files and other cores present on the active supervisor module.
Step 2 Click Close to close the dialog box.
Configuring System Health
The Online Health Management System (OHMS) (system health) is a hardware fault detection and recovery feature. It ensures the general health of switching, services, and supervisor modules in any switch in the Cisco MDS 9000 Family.
This section includes the following topics:
- Task Flow for Configuring System Health
- Enabling System Health Initiation
- Configuring Loopback Test Configuration Frequency
- Cofiguring Loopback Test Configuration Frame Length
- Configuring Hardware Failure Action
- Performing Test Run Requirements
- Clearing Previous Error Reports
- Performing Internal Loopback Tests
- Performing External Loopback Tests
- Performing Serdes Loopbacks
Task Flow for Configuring System Health
Follow these steps to configure system health:
Step 1 Enable System Health Initiation.
Step 2 Configure Loopback Test Configuration Frequency.
Step 3 Cofigure Loopback Test Configuration Frame Length.
Step 4 Configure Hardware Failure Action.
Step 5 Perform Test Run Requirements.
Step 6 Clear Previous Error Reports.
Step 7 Perform Internal Loopback Tests.
Step 8 Perform External Loopback Tests.
Step 9 Perform Serdes Loopbacks.
Enabling System Health Initiation
By default, the system health feature is enabled in each switch in the Cisco MDS 9000 Family.
To disable or enable this feature in any switch in the Cisco MDS 9000 Family, follow these steps:
Enables (default) system health to run tests in this switch. |
||
Disables system health from testing the specified interface. |
||
Enables (default) system health to test for the specified interface. |
Configuring Loopback Test Configuration Frequency
To configure the frequency of loopback tests for all modules on a switch, follow these steps:
Cofiguring Loopback Test Configuration Frame Length
To configure the frame length for loopback tests for all modules on a switch, follow these steps:
Configuring Hardware Failure Action
To configure failure action in a switch, follow these steps:
Performing Test Run Requirements
To perform the required test on a specific module, follow these steps:
Clearing Previous Error Reports
Use the EXEC-level system health clear-errors command at the interface or module level to erase any previous error conditions logged by the system health application. The bootflash , the eobc , the inband , the loopback , and the mgmt test options can be individually specified for a given module.
The following example clears the error history for the specified Fibre Channel interface:
The following example clears the error history for the specified module:
The following example clears the management test error history for the specified module:
Performing Internal Loopback Tests
You can run manual loopback tests to identify hardware errors in the data path in the switching or services modules, and the control path in the supervisor modules. Internal loopback tests send and receive FC2 frames to and from the same ports and provide the round-trip time taken in microseconds. These tests are available for Fibre Channel, IPS, and iSCSI interfaces.
Use the EXEC-level system health internal-loopback command to explicitly run this test on demand (when requested by the user) within ports for the entire module.
Use the EXEC-level system health internal-loopback command to explicitly run this test on demand (when requested by the user) within ports for the entire module and override the frame count configured on the switch.
Use the EXEC-level system health internal-loopback command to explicitly run this test on demand (when requested by the user) within ports for the entire module and override the frame length configured on the switch.
Note If the test fails to complete successfully, the software analyzes the failure and prints the following error:External loopback test on interface fc 7/2 failed. Failure reason: Failed to loopback, analysis complete Failed device ID 3 on module 1
Choose Interface > Diagnostics > Internal to perform an internal loopback test from Device Manager.
Performing External Loopback Tests
You can run manual loopback tests to identify hardware errors in the data path in the switching or services modules, and the control path in the supervisor modules. External loopback tests send and receive FC2 frames to and from the same port or between two ports.
You need to connect a cable (or a plug) to loop the Rx port to the Tx port before running the test. If you are testing to and from the same port, you need a special loop cable. If you are testing to and from different ports, you can use a regular cable. This test is only available for Fibre Channel interfaces.
Use the EXEC-level system health external-loopback interface interface command to run this test on demand for external devices connected to a switch that is part of a long-haul network.
Use the EXEC-level system health external-loopback source interface destination interface interface command to run this test on demand between two ports on the switch.
Use the EXEC-level system health external-loopback interface frame-count command to run this test on demand for external devices connected to a switch that is part of a long-haul network and override the frame count configured on the switch.
Use the EXEC-level system health external-loopback interface frame-length command to run this test on demand for external devices connected to a switch that is part of a long-haul network and override the frame length configured on the switch.
Use the system health external-loopback interface force command to shut down the required interface directly without a back out confirmation.
Note If the test fails to complete successfully, the software analyzes the failure and prints the following error:External loopback test on interface fc 7/2 failed. Failure reason: Failed to loopback, analysis complete Failed device ID 3 on module 1
Choose Interface > Diagnostics > External to perform an external loopback test from Device Manager.
Performing Serdes Loopbacks
Serializer/Deserializer (serdes) loopback tests the hardware for a port. These tests are available for Fibre Channel interfaces.
Use the EXEC-level system health serdes-loopback command to explicitly run this test on demand (when requested by the user) within ports for the entire module.
Use the EXEC-level system health serdes-loopback command to explicitly run this test on demand (when requested by the user) within ports for the entire module and override the frame count configured on the switch.
Use the EXEC-level system health serdes-loopback command to explicitly run this test on demand (when requested by the user) within ports for the entire module and override the frame length configured on the switch.
Note If the test fails to complete successfully, the software analyzes the failure and prints the following error:External loopback test on interface fc 3/1 failed. Failure reason: Failed to loopback, analysis complete Failed device ID 3 on module 3.
Configuring On-Board Failure Logging
The Generation 2 Fibre Channel switching modules provide the facility to log failure data to persistent storage, which can be retrieved and displayed for analysis. This on-board failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. The information will help in post-mortem analysis of failed cards.
This section includes the following topics:
Configuring OBFL for the Switch
To configure OBFL for all the modules on the switch, follow these steps:
Configuring OBFL for a Module
To configure OBFL for specific modules on the switch, follow these steps:
Verifying System Processes and Logs Configuration
To display the system processes and logs configuration information, perform one of the following tasks:
For detailed information about the fields in the output from these commands, refer to the Cisco MDS 9000 Family Command Reference .
This section includes the following topics:
- Displaying System Processes
- Displaying System Status
- Displaying Core Status
- Verifying First and Last Core Status
- Displaying System Health
- Verifying Loopback Test Configuration Frame Length
- Displaying OBFL for the Switch
- Displaying the OBFL for a Module
- Displaying OBFL Logs
- Displaying the Module Counters Information
Displaying System Processes
To obtain general information about all processes, follow these steps:
Step 1 Choose Admin > Running Processes .
You see the Running Processes dialog box.
Step 2 Click Close to close the dialog box.
Use the show processes command to obtain general information about all processes (see Example 56-1 to Example 56-6).
Example 56-1 Displays System Processes
– D = uninterruptible sleep (usually I/O).
– R = runnable (on run queue).
– Z = defunct (“zombie”) process.
- NR = not running.
- ER = should be running but currently not-running.
- PC = current program counter in hex format.
- Start_cnt = number of times a process has been started (or restarted).
- TTY = terminal that controls the process. A hyphen usually means a daemon not running on any particular TTY.
- Process Name = name Name of the process.
Example 56-2 Displays CPU Utilization Information
- MemAllocated = Sum of all the dynamically allocated memory that this process has received from the system, including memory that may have been returned
- Runtime CPU Time (ms) = CPU time the process has used, expressed in milliseconds.microseconds
- Invoked = number of times the process has been invoked.
- uSecs = microseconds of CPU time on average for each process invocation.
- 1Sec = CPU utilization in percentage for the last one second.
Example 56-3 Displays Process Log Information
- Normal-exit = whether or not the process exited normally.
- Stack-trace = whether or not there is a stack trace in the log.
- Core = whether or not there exists a core file.
- Log-create-time = when the log file got generated.
Example 56-4 Displays Detail Log Information About a Process
Example 56-5 Displays All Process Log Details
Displaying System Status
Use the show system command to display system-related status information (see Example 56-7 to Example 56-10).
Example 56-7 Displays Default Switch Port States
Example 56-8 Displays Error Information for a Specified ID
Example 56-9 Displays the System Reset Information
The show system reset-reason command displays the following information:
- In a Cisco MDS 9513 Director, the last four reset-reason codes for the supervisor module in slot 7 and slot 8 are displayed. If either supervisor module is absent, the reset-reason codes for that supervisor module are not displayed.
- In a Cisco MDS 9506 or Cisco MDS 9509 switch, the last four reset-reason codes for the supervisor module in slot 5 and slot 6 are displayed. If either supervisor module is absent, the reset-reason codes for that supervisor module are not displayed.
- In a Cisco MDS 9200 Series switch, the last four reset-reason codes for the supervisor module in slot 1 are displayed.
- The show system reset-reason module number command displays the last four reset-reason codes for a specific module in a given slot. If a module is absent, then the reset-reason codes for that module are not displayed.
Use the clear system reset-reason command to clear the reset-reason information stored in NVRAM and volatile persistent storage.
- In a Cisco MDS 9500 Series switch, this command clears the reset-reason information stored in NVRAM in the active and standby supervisor modules.
- In a Cisco MDS 9200 Series switch, this command clears the reset-reason information stored in NVRAM in the active supervisor module.
Example 56-10 Displays System Uptime
Use the show system resources command to display system-related CPU and memory statistics (see Example 56-11).
Example 56-11 Displays System-Related CPU and Memory Information
- Load average—Displays the number of running processes. The average reflects the system load over the past 1, 5, and 15 minutes.
- Processes—Displays the number of processes in the system, and how many are actually running when the command is issued.
- CPU states—Displays the CPU usage percentage in user mode, kernel mode, and idle time in the last one second.
- Memory usage—Displays the total memory, used memory, free memory, memory used for buffers, and memory used for cache in KB. Buffers and cache are also included in the used memory statistics.
To display system status from Device Manager, follow these steps:
Step 1 Choose Physical > System .
You see the System dialog box.
Step 2 Click Close to close the dialog box.
Displaying Core Status
Use the show system cores command to display the currently configured scheme for copying cores. See Examples 56-12 to 56-15 .
Example 56-12 Displays the Message when Cores are Transferred to TFTP
Example 56-13 Displays the Message when Cores are Transferred to the External CF
Example 56-14 Displays All Cores Available for Upload from the Active Supervisor Module
Example 56-15 Displays Logs on the Local System
To display cores on a switch, follow these steps:
Note Ensure that SSH2 is enabled on this switch.
Step 1 Choose Admin > Show Cores .
You see the Show Cores dialog box.
Module-num
shows the slot number on which the core was generated.
Step 2 Click Close to close the dialog box.
Verifying First and Last Core Status
You can view specific information about the saved core files. Example 56-16 provides further details on saved core files.
Example 56-16 Regular Service on vdc 2 on Active Supervisor Module
There are five radius core files from vdc2 on the active supervisor module. The second and third oldest files are deleted to comply with the number of core files defined in the service.conf file.
Displaying System Health
Use the show system health command to display system-related status information (see Example 56-17 to Example 56-22).
Example 56-17 Displays the Current Health of All Modules in the Switch
Example 56-18 Displays the Current Health of a Specified Module
Example 56-19 Displays Health Statistics for All Modules
Example 56-20 Displays Statistics for a Specified Module
Example 56-21 Displays Loopback Test Statistics for the Entire Switch
Example 56-22 Displays Loopback Test Statistics for a Specified Interface
Note Interface-specific counters will remain at zero unless the module-specific loopback test reports errors or failures.
Example 56-23 Displays the Loopback Test Time Log for All Modules
Example 56-24 Displays the Loopback Test Time Log for a Specified Module
Verifying Loopback Test Configuration Frame Length
To verify the loopback frequency configuration, use the show system health loopback frame-length command.
Displaying OBFL for the Switch
Use the show logging onboard status command to display the configuration status of OBFL.
Displaying the OBFL for a Module
Use the show logging onboard status command to display the configuration status of OBFL.
Displaying OBFL Logs
To display OBFL information stored in CompactFlash on a module, use the following commands:
Additional References
For additional information related to implementing system processes and logs, see the following section:
Feature History for System Processes and Logs
Table 56-3 lists the release history for this feature. Only features that were introduced or modified in Release 3.x or a later release appear in the table.