對不完整的Diagnostics.sh指令碼執行進行故障排除

下載選項

PDF (360.3 KB)
在多種裝置上使用 Adobe Reader 檢視
ePub (80.3 KB)
在 iPhone、iPad、Android、Sony Reader 或 Windows Phone 上的各種應用程式中檢視
Mobi (Kindle) (66.2 KB)
在 Kindle 裝置或多部裝置的 Kindle 應用程式上檢視

已更新: 2023 年 7 月 7 日

文件 ID:220562

無偏見用語

本產品的文件集力求使用無偏見用語。針對本文件集的目的，無偏見係定義為未根據年齡、身心障礙、性別、種族身分、民族身分、性別傾向、社會經濟地位及交織性表示歧視的用語。由於本產品軟體使用者介面中硬式編碼的語言、根據 RFP 文件使用的語言，或引用第三方產品的語言，因此本文件中可能會出現例外狀況。深入瞭解思科如何使用包容性用語。

關於此翻譯

思科已使用電腦和人工技術翻譯本文件，讓全世界的使用者能夠以自己的語言理解支援內容。請注意，即使是最佳機器翻譯，也不如專業譯者翻譯的內容準確。Cisco Systems, Inc. 對這些翻譯的準確度概不負責，並建議一律查看原始英文文件（提供連結）。

簡介

本檔案介紹對思科原則套件(CPS)中不完整的diagnostics.sh指令執行進行疑難排解的程式。

作者：Ullas Kumar E，思科TAC工程師。

必要條件

需求

思科建議您瞭解以下主題：

Linux
CPS

注意：思科建議您必須具有根訪問許可權許可權到CPS CLI。

採用元件

本文中的資訊係根據以下軟體和硬體版本：

CPS 21.1
Centos 8.0
整合運算系統(UCS)-B

本文中的資訊是根據特定實驗室環境內的裝置所建立。文中使用到的所有裝置皆從已清除（預設）的組態來啟動。如果您的網路運作中，請確保您瞭解任何指令可能造成的影響。

背景資訊

Diagnostics.sh是可以在CPS的pcrfclient或installer節點中執行的基本故障排除命令，用於驗證系統的當前狀態。

它提供詳細的引數清單，作為CPS運行狀況檢查的一部分。

此指令碼針對運行CPS系統的各種訪問、監控和配置點運行。

在高可用性(HA)或地理冗餘(GR)環境中，指令碼始終先對所有虛擬機器(VM)執行ping檢查，然後再執行任何其他檢查，並將未通過ping測試的所有虛擬機器新增到IGNORED_HOSTS變數中。這有助於降低指令碼函式出錯的可能性。

Examples:
 /var/qps/bin/diag/diagnostics.sh -q
 /var/qps/bin/diag/diagnostics.sh --basic_ports --clock_skew

以下是此指令碼執行的突出檢查。

--basic_ports : Run basic port checks
 For AIO: 80, 11211, 27017, 27749, 7070, 8080, 8090, 8182, 9091, 9092
 For HA/GR: 80, 11211, 7070, 8080, 8081, 8090, 8182, 9091, 9092, and Mongo DB ports based on /etc/broadhop/mongoConfig.cfg
 --clock_skew : Check clock skew between lb01 and all vms (Multi-Node Environment only)
 --diskspace : Check diskspace
 --get_active_alarms : Get the active alarms in the CPS
 --get_frag_status : Get fragmentation status for Primary members of DBs viz. session_cache, sk_cache, diameter, spr, and balance_mgmt.
 --get_replica_status : Get the status of the replica-sets present in environment. (Multi-Node Environment only)
 --get_shard_health : Get the status of the sharded database information present in environment. (Multi-Node Environment only)
 --get_sharding_status : Get the status of the sharding information present in environment. (Multi-Node Environment only).
 --get_session_shard_health : Get the session shard health status information present in environment. (Multi-Node Environment only).
 --get_peer_status : Get the diameter peer information present in environment. (Multi-Node Environment only).
 --get_sharded_replica_status : Get the status of the shards present in environment. (Multi-Node Environment only)
 --ha_proxy : Connect to HAProxy to check operation and performance statistics, and ports (Multi-Node Environment only)
      http://lbvip01:5540/haproxy?stats
      http://lbvip01:5540//haproxy-diam?stats
 --help -h : Help - displays this help
 --hostnames : Check hostnames are valid (no underscores, resolvable, in /etc/broadhop/servers) (AIO only)
 --ignored_hosts : Ignore the comma separated list of hosts. For example --ignored_hosts='portal01,portal02'
      Default is 'portal01,portal02,portallb01,portallb02' (Multi-Node Environment only)
 --ping_check : Check ping status for all VM
 --policy_revision_status : Check the policy revision status on all QNS,LB,UDC VMs.
 --lwr_diagnostics : Retrieve diagnostics from CPS LWR kafka processes
 --qns_diagnostics : Retrieve diagnostics from CPS java processes
 --qns_login : Check qns user passwordless login
 --quiet -q : Quiet output - display only failed diagnostics
 --radius : Run radius specific checks
 --redis : Run redis specific checks
 --whisper : Run whisper specific checks
 --aido : Run Aido specific checks
 --svn : Check svn sync status between pcrfclient01 & pcrfclient02 (Multi-Node Environment only)
 --tacacs : Check Tacacs server reachability
 --swapspace : Check swap space
 --verbose -v : Verbose output - display *all* diagnostics (by default, some are grouped for readability)
 --virtual_ips : Ensure Virtual IP Addresses are operational (Multi-Node Environment only)
 --vm_allocation : Ensure VM Memory and CPUs have been allocated according to recommendations

問題

在某些情況下，執行diagnostics.sh指令碼時可能會一度掛起，並且無法進一步移動或完成指令碼執行。

您可以執行指令碼，並觀察指令碼停滯在「正在檢查自動智慧資料庫操作(AIDO)狀態」不繼續進行Subversion Number(SVN)檢查和更進一步檢查。

[root@installer ~]# diagnostics.sh 
CPS Diagnostics HA Multi-Node Environment
---------------------------
Ping check for all VMs...
Hosts that are not 'pingable' are added to the IGNORED_HOSTS variable...[PASS]
Checking basic ports for all VMs...[PASS]
Checking qns passwordless logins for all VMs...[PASS]
Validating hostnames...[PASS]
Checking disk space for all VMs...[PASS]
Checking swap space for all VMs...[PASS]
Checking for clock skew for all VMs...[PASS]
Retrieving diagnostics from pcrfclient01:9045...[PASS]
Retrieving diagnostics from pcrfclient02:9045...[PASS]
Checking redis server instances status on lb01...[PASS]
Checking redis server instances status on lb02...[PASS]
Checking whisper status on all VMs...[PASS]
Checking AIDO status on all VMs...[PASS]
.
.

當檢查diagnostics.sh的冗餘輸出時，會執行一個步驟來檢查SVN狀態，指令碼不會進一步執行下去。它表示diagnostics.sh指令碼在出廠檢查時停滯。

[[32mPASS[0m] AIDO Pass
[[ -f /var/tmp/aido_extra_info ]]
cat /var/tmp/aido_extra_info
There is no provision to check AIDO service status of installer from this host
/bin/rm -fr /var/tmp/aido_extra_info
check_all_svn
++ is_enabled true
++ [[ '' == \t\r\u\e ]]
++ [[ true != \f\a\l\s\e ]]
++ echo true
[[ true == \t\r\u\e ]]
++ awk '{$1=""; $2=""; print}'
++ /usr/bin/ssh root@pcrfclient01 -o ConnectTimeout=2 /usr/bin/facter.  
++ grep svn_slave_list

指令碼登入到pcrfclient01並從factor命令輸出檢查svn_slave_list，該命令未完全執行。

此外，您還可以登入到pcrfcleint01，檢查factor命令是否正確運行並提供所需的輸出。

[root@pcrfclient01 ]# facter | grep eth
[root@installer ~]# ^C

當您檢查pcrfclient01的負載平均值時，發現它非常高。

[root@pcrfclient01 pacemaker]# top
top - 15:34:18 up 289 days, 14:55, 1 user, load average: 2094.68, 2091.77, 2086.36

驗證與因素相關的進程是否停滯並導致高平均負載。

[root@pcrfclient01 ~]# ps -ef | grep facter | wc -l
2096

解決方案

清除這些停滯進程並降低平均負載的最終解決方案是重新啟動pcrfclient01虛擬機器。清除因素進程停滯和解決diagnostics掛起問題的過程。sh執行：

步驟 1.登入到pcrfclient節點並執行重新啟動命令。

[root@pcrfclient01 ~]# init 6

步驟 2.驗證pcrfcleitn01 VM是否啟動且穩定。

[root@pcrfclient01 ~]# uptime
10:07:15 up 1 min, 4:09, 1 user, load average: 0.33, 0.33, 0.36
[root@pcrfclient01 ~]#

步驟 3.驗證pcrfclient01的平均負載是否正常。

[root@instapcrfclient01ller ~]# top
top - 10:07:55 up 1 min, 4:10, 1 user, load average: 0.24, 0.31, 0.35

步驟 4.運行diagnostics.sh並驗證指令碼執行是否完成。

[root@instapcrfclient01ller ~]# diagnostics.sh

修訂記錄

修訂	發佈日期	意見
1.0	07-Jul-2023	初始版本

由思科工程師貢獻

Ullas Kumar E
Cisco TAC Engineer