简介
本文档介绍如何从UAME内存泄漏问题 — CSCvu73187中恢复Ultra Automation and Monitoring Engine(UAME)。
问题
Ultra M运行状况监视器上的弹性服务控制器(ESC)警报:
[root@pod1-ospd ~]# cat /var/log/cisco/ultram-health/*.report | grep -i xxx
10.10.10.10/vnf-esc | esc | XXX | vnf-esc:(error)
解决方案
状态检查
步骤1.登录到OpenStack Platform Director(OSP-D)并验证vnf-esc错误。
[stack@pod1-ospd ~]$ cat /var/log/cisco/ultram-health/*.report | grep -i xxx
[stack@pod1-ospd ~]$ cat /var/log/cisco/ultram-health/*.report | grep -iv ':-)'
步骤2.确认无法通过管理IP 10.241.179.116登录UAME,但IP是ping的:
(pod1) [stack@pod1-ospd ~]$ ssh ubuntu@10.10.10.10
ssh_exchange_identification: read: Connection reset by peer
(pod1) [stack@pod1-ospd ~]$ ping -c 5 10.10.10.10
PING 10.10.10.10 (10.10.10.10) 56(84) bytes of data.
64 bytes from 10.10.10.10: icmp_seq=1 ttl=57 time=0.242 ms
64 bytes from 10.10.10.10: icmp_seq=2 ttl=57 time=0.214 ms
64 bytes from 10.10.10.10: icmp_seq=3 ttl=57 time=0.240 ms
64 bytes from 10.10.10.10: icmp_seq=4 ttl=57 time=0.255 ms
64 bytes from 10.10.10.10: icmp_seq=5 ttl=57 time=0.240 ms
--- 10.10.10.10 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4000ms
rtt min/avg/max/mdev = 0.214/0.238/0.255/0.016 ms
步骤3.确认与ESC和UAME相关的VM处于活动状态并在OSP-D上运行。
[stack@pod1-ospd ~]$ source *core
(pod1) [stack@pod1-ospd ~]$
(pod1) [stack@pod1-ospd ~]$ nova list --field name,status,host,instance_name,power_state | grep esc
| 31416ffd-0719-4ce5-9e99-a1234567890e | pod1-uame-1 | ACTIVE | - | Running | pod1-AUTOMATION-ORCH=172.16.180.15; pod1-AUTOMATION-MGMT=172.16.181.33 |
| d6830e97-bd82-4d8e-9467-a1234567890e | pod1-uame-2 | ACTIVE | - | Running | pod1-AUTOMATION-ORCH=172.16.180.8; pod1-AUTOMATION-MGMT=172.16.181.12
(pod1) [stack@pod1-ospd ~]$ nova list --field name,status,host,instance_name,power_state | grep uame
| 0c1596bc-e50f-4374-9098-a1234567890e | pod1-esc-vnf-esc-core-esc-1 | ACTIVE | - | Running | pod1-AUTOMATION-ORCH=172.16.180.10; pod1-AUTOMATION-MGMT=172.16.181.10 |
| 3875618d-dcbe-4748-b196-a1234567890e | pod1-esc-vnf-esc-core-esc-2 | ACTIVE | - | Running | pod1-AUTOMATION-ORCH=172.16.180.18; pod1-AUTOMATION-MGMT=172.16.181.5
步骤4.确认您能够连接到主ESC和备用ESC。检验ESC运行状况是否也已通过。
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ cat /opt/cisco/esc/keepalived_state
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ health.sh
============== ESC HA with DRBD =================
vimmanager (pgid 14654) is running
monitor (pgid 14719) is running
mona (pgid 14830) is running
snmp is disabled at startup
etsi is disabled at startup
pgsql (pgid 15130) is running
keepalived (pgid 13083) is running
portal is disabled at startup
confd (pgid 15027) is running
filesystem (pgid 0) is running
escmanager (pgid 15316) is running
=======================================
ESC HEALTH PASSED
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ ssh admin@172.16.180.12
####################################################################
# ESC on pod1-esc-vnf-esc-core-esc-2 is in BACKUP state.
####################################################################
[admin@pod1-esc-vnf-esc-core-esc-1 ~]$ cat /opt/cisco/esc/keepalived_state
BACKUP
恢复步骤
步骤1.登录Pod1-uame-2实例的Horizon Dashboard控制台。
步骤2.从Horizon Dashboard软重新启动Pod1-uame-2 VM实例。观察实例的控制台日志消息。
步骤3.在Horizon Dashboard的Pod1-uame-2 VM实例的控制台中显示登录提示后,通过UAME的管理IP 10.10.10.10启动SSH
(pod1) [stack@pod1-ospd ~]$ ssh ubuntu@10.10.10.10
注意:仅当此步骤成功时,才继续下一步。
步骤4.检查主UAME上磁盘空间,尤其是/dev/vda3文件系统的磁盘空间。
ubuntu@pod1-uame-1:~$ df -kh
步骤5.在主UAME上截断syslog或syslog.1文件(两个文件中较大的文件大小,通常为MB或GB)(T)。
ubuntu@pod1-uame-1:~$ sudo su -
root@pod1-uame-1:~#
root@pod1-uame-1:~# cd /var/log
root@pod1-uame-1:/var/log# ls -lrth *syslog*
root@pod1-uame-1:/var/log# > syslog.1 or > syslog
步骤6.确保系统日志或syslog.1文件大小现在在主UAME上为0个字节。
root@pod1-uame-1:/var/log# ls -lrth *syslog*
步骤7.确保df -kh应具有足够的可用空间用于主UAME上的文件系统分区。
ubuntu@pod1-uame-1:~$ df -kh
SSH到辅助UAME。
ubuntu@pod1-uame-1:~$ ssh ubuntu@172.16.180.8
password:
...
ubuntu@pod1-uame-2:~$
步骤8.在辅助UAME上截断syslog或syslog.1文件(两个文件中较大的文件大小,通常为MB或GB)。
ubuntu@pod1-uame-2:~$ sudo su -
root@pod1-uame-2:~#
root@pod1-uame-2:~# cd /var/log
root@pod1-uame-2:/var/log# ls -lrth *syslog*
root@pod1-uame-2:/var/log# > syslog.1 or > syslog
步骤9.确保系统日志或syslog.1文件大小现在在辅助UAME上为0字节。
root@pod1-uame-2:/var/log# ls -lrth *syslog*
步骤10.确保df -kh应有足够的可用空间用于辅助UAME上的文件系统分区。
ubuntu@pod1-uame-2:~$ df -kh
恢复状态检查后
步骤1.等待Ultra M运行状况监视器至少一次迭代,以确认运行状况报告上未发现vnf-esc错误。
[stack@pod1-ospd ~]$ cat /var/log/cisco/ultram-health/*.report | grep -i xxx
[stack@pod1-ospd ~]$ cat /var/log/cisco/ultram-health/*.report | grep -iv ':-)'
步骤2.确认ESC和UAME VM处于活动状态并在OSPD上运行。
[stack@pod1-ospd ~]$ source *core
(pod1) [stack@pod1-ospd ~]$ nova list --field name,status,host,instance_name,power_state | grep esc
(pod1) [stack@pod1-ospd ~]$ nova list --field name,status,host,instance_name,power_state | grep uame
步骤3. SSH到主ESC和备份ESC,并确认ESC运行状况也已通过。
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ cat /opt/cisco/esc/keepalived_state
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ health.sh
============== ESC HA with DRBD =================
vimmanager (pgid 14638) is running
monitor (pgid 14703) is running
mona (pgid 14759) is running
snmp is disabled at startup
etsi is disabled at startup
pgsql (pgid 15114) is running
keepalived (pgid 13205) is running
portal is disabled at startup
confd (pgid 15011) is running
filesystem (pgid 0) is running
escmanager (pgid 15300) is running
=======================================
ESC HEALTH PASSED
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ ssh admin@
admin@172.16.181.26's password:
Last login: Fri May 1 10:28:12 2020 from 172.16.180.13
####################################################################
# ESC on scucs501-esc-vnf-esc-core-esc-2 is in BACKUP state.
####################################################################
[admin@pod1-esc-vnf-esc-core-esc-2 ~]$ cat /opt/cisco/esc/keepalived_state
BACKUP
步骤4.在UAME中确认ESC vnfd处于ALIVE状态。
ubuntu@pod1-uame-1:~$ sudo su
ubuntu@pod1-uame-1:~$ confd_cli -u admin -C
pod1-uame-1# show vnfr state