The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes the steps required to replace faulty components mentioned here in a Unified Computing System (UCS) server in an Ultra-M setup.
This procedure applies for an Openstack environment with the use of NEWTON version where ESC does not manage CPAR and CPAR is installed directly on the VM deployed on Openstack.
Ultra-M is a pre-packaged and validated virtualized mobile packet core solution that is designed in order to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:
The high-level architecture of Ultra-M and the components involved are depicted in this image:
This document is intended for Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps required to be carried out at OpenStack and Redhat OS.
Note: Ultra M 5.1.x release is considered in order to define the procedures in this document.
MoP | Method of Procedure |
OSD | Object Storage Disks |
OSPD | OpenStack Platform Director |
HDD | Hard Disk Drive |
SSD | Solid State Drive |
VIM | Virtual Infrastructure Manager |
VM | Virtual Machine |
EM | Element Manager |
UAS | Ultra Automation Services |
UUID | Universally Unique Identifier |
Before you replace a faulty component, it is important to check the current state of your Red Hat OpenStack Platform environment. It is recommended that you check the current state in order to avoid complications when the replacement process is on. It can be achieved by this flow of replacement.
In case of recovery, Cisco recommends to take a backup of the OSPD database with the use of these steps:
[root@director ~]# mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
[root@director ~]# tar --xattrs -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql
/etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack
tar: Removing leading `/' from member names
This process ensures that a node can be replaced without affecting the availability of any instances. Also, it is recommended to back up the StarOS configuration especially if the compute/OSD-compute node to be replaced hosts the Control Function (CF) Virtual Machine (VM).
Note: If the Server is the Controller node, proceed to the section "", otherwise continue with the next section. Ensure that you have the snapshot of the instance so that you can restore the VM when needed. Follow the procedure on how to take snapshot of the VM.
Identify the VMs that are hosted on the server.
[stack@al03-pod2-ospd ~]$ nova list --field name,host +--------------------------------------+---------------------------+----------------------------------+ | ID | Name | Host | +--------------------------------------+---------------------------+----------------------------------+ | 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | pod2-stack-compute-4.localdomain | | 3bc14173-876b-4d56-88e7-b890d67a4122 | aaa2-21 | pod2-stack-compute-3.localdomain | | f404f6ad-34c8-4a5f-a757-14c8ed7fa30e | aaa21june | pod2-stack-compute-3.localdomain | +--------------------------------------+---------------------------+----------------------------------+
Note: In the output shown here, the first column corresponds to the UUID, the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.
Backup: SNAPSHOT PROCESS
Step 1. Open any SSH client connected to the TMO Production network and connect to the CPAR instance.
It is important not to shutdown all 4 AAA instances within one site at the same time, do it in a one by one fashion.
Step 2. In order to shut down the CPAR application, run the command:
/opt/CSCOar/bin/arserver stop
A Message “Cisco Prime Access Registrar Server Agent shutdown complete.” must show up.
Note: If a user left a CLI session open, the arserver stop command won’t work and this message is displayed:
ERROR: You cannot shut down Cisco Prime Access Registrar while the CLI is being used. Current list of running CLI with process id is: 2903 /opt/CSCOar/bin/aregcmd –s
In this example, the highlighted process id 2903 needs to be terminated before CPAR can be stopped. If this is the case, terminate this process by running the command:
kill -9 *process_id*
Then, repeat the step 1.
Step 3. In order to verify that the CPAR application was indeed shutdown, run the command:
/opt/CSCOar/bin/arstatus
This messages must appear:
Cisco Prime Access Registrar Server Agent not running Cisco Prime Access Registrar GUI not running
Step 1. Enter the Horizon GUI website that corresponds to the Site (City) currently being worked on.
When you access Horizon, this screen is observed.
Step 2. Navigate to Project > Instances as shown in this image.
If the user used was cpar, then only the 4 AAA instances appear in this menu.
Step 3. Shut down only one instance at a time, repeat the whole process in this document. In order to shutdown the VM, navigate to Actions > Shut Off Instance as shown in this image and confirm your selection..
Step 4. Validate that the instance was indeed shut down by checking the Status = Shutoff and Power State = Shut Down as shown in this image.
This step ends the CPAR shutdown process.
Once the CPAR VMs are down, the snapshots can be taken in parallel, as they belong to independent computes.
The four QCOW2 files are created in parallel.
Take a snapshot of each AAA instance (25 minutes -1 hour) (25 minutes for instances that used a qcow image as a source and 1 hour for instances that user a raw image as a source)
3. Click Create Snapshot in order to proceed with the snapshot creation (this needs to be executed on the corresponding AAA instance) as shown in this image.
4. Once the snapshot is executed, navigate to Images menu and verify that all finish and report no problems as shown in this image.
5. The next step is to download the snapshot on a QCOW2 format and transfer it to a remote entity, in case the OSPD is lost during this process. In order to achieve this, identify the snapshot by running the command glance image-list at OSPD level.
[root@elospd01 stack]# glance image-list +--------------------------------------+---------------------------+ | ID | Name | +--------------------------------------+---------------------------+ | 80f083cb-66f9-4fcf-8b8a-7d8965e47b1d | AAA-Temporary | | 22f8536b-3f3c-4bcc-ae1a-8f2ab0d8b950 | ELP1 cluman 10_09_2017 | | 70ef5911-208e-4cac-93e2-6fe9033db560 | ELP2 cluman 10_09_2017 | | e0b57fc9-e5c3-4b51-8b94-56cbccdf5401 | ESC-image | | 92dfe18c-df35-4aa9-8c52-9c663d3f839b | lgnaaa01-sept102017 | | 1461226b-4362-428b-bc90-0a98cbf33500 | tmobile-pcrf-13.1.1.iso | | 98275e15-37cf-4681-9bcc-d6ba18947d7b | tmobile-pcrf-13.1.1.qcow2 | +--------------------------------------+---------------------------+
6. Once you identify the snapshot to download (the one marked in green), you can download it on a QCOW2 format with the command glance image-download as depicted here.
[root@elospd01 stack]# glance image-download 92dfe18c-df35-4aa9-8c52-9c663d3f839b --file /tmp/AAA-CPAR-LGNoct192017.qcow2 &
7. Once the download process finishes, a compression process needs to be executed as that snapshot can be filled with ZEROES because of processes, tasks and temporary files handled by the Operating System (OS). The command to be used for file compression is virt-sparsify.
[root@elospd01 stack]# virt-sparsify AAA-CPAR-LGNoct192017.qcow2 AAA-CPAR-LGNoct192017_compressed.qcow2
This process can take some time (around 10-15 minutes). Once finished, the resulting file is the one that needs to be transferred to an external entity as specified on next step.
Verification of the file integrity is required, in order to achieve this, run the next command and look for the “corrupt” attribute at the end of its output.
[root@wsospd01 tmp]# qemu-img info AAA-CPAR-LGNoct192017_compressed.qcow2 image: AAA-CPAR-LGNoct192017_compressed.qcow2 file format: qcow2 virtual size: 150G (161061273600 bytes) disk size: 18G cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false
[stack@director ~]$ nova stop aaa2-21 Request to stop server aaa2-21 has been accepted. [stack@director ~]$ nova list +--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+ | 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | ACTIVE | - | Running | tb1-mgmt=172.16.181.14, 10.225.247.233; radius-routable1=10.160.132.245; diameter-routable1=10.160.132.231 | | 3bc14173-876b-4d56-88e7-b890d67a4122 | aaa2-21 | SHUTOFF | - | Shutdown | diameter-routable1=10.160.132.230; radius-routable1=10.160.132.248; tb1-mgmt=172.16.181.7, 10.225.247.234 | | f404f6ad-34c8-4a5f-a757-14c8ed7fa30e | aaa21june | ACTIVE | - | Running | diameter-routable1=10.160.132.233; radius-routable1=10.160.132.244; tb1-mgmt=172.16.181.10 | +--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+
Power off the specified server. The steps in order to replace a faulty component on UCS C240 M4 server can be referred from:
Replacing the Server Components
Recovery process
It is possible to redeploy the previous instance with the snapshot taken in previous steps.
Step 1. [optional] If there is no previous VMsnapshot available, then connect to the OSPD node where the backup was sent and SFTP the backup back to its original OSPD node. With sftproot@x.x.x.x where x.x.x.x is the IP of a the original OSPD. Save the snapshot file in /tmp directory.
Step 2. Connect to the OSPD node where the instance can be re-deployed as shown in the image.
Source the environment variables with this command:
# source /home/stack/pod1-stackrc-Core-CPAR
Step 3. In order to use the snapshot as an image it is necessary to upload it to the horizon as such. Run the next command to do so.
#glance image-create -- AAA-CPAR-Date-snapshot.qcow2 --container-format bare --disk-format qcow2 --name AAA-CPAR-Date-snapshot
The process can be seen in horizon and as shown in this image.
Step 4. In Horizon, navigate to Project > Instances and click on Launch Instance as shown in this image.
Step 5. Enter the Instance Name and choose the Availability Zone as shown in this image.
Step 6. In the Source tab, choose the image in order to create the instance. In the Select Boot Source menu select image, a list of images are shown, choose the one that was previously uploaded by clicking on its + sign and as shown in this image.
Step 7. In the Flavor tab, choose the AAA flavor by clicking on the + sign as shown in this image.
Step 8. Finally, navigate to the Network tab and choose the networks that the instance will need by clicking on the + sign. For this case, select diameter-soutable1, radius-routable1 and tb1-mgmt as shown in this image.
Finally, click on Launch Instance in order to create it. The progress can be monitored in Horizon:
After a few minutes, the instance is completely deployed and ready for use as shown in this image.
A floating IP address is a routable address, which means that it’s reachable from the outside of Ultra M/Openstack architecture, and it’s able to communicate with other nodes from the network.
Step 1. In the Horizon top menu, navigate to Admin > Floating IPs.
Step 2. Click Allocate IP to Project.
Step 3. In the Allocate Floating IP window, select the Pool from which the new floating IP belongs, the Project where it is going to be assigned, and the new Floating IP Address itself.
For example:
Step 4. Click Allocate Floating IP button.
Step 5. In the Horizon top menu, navigate to Project > Instances.
Step 6. In the Action column, click on the arrow that points down in the Create Snapshot button, a menu is displayed. Select Associate Floating IP option.
Step 7. Select the corresponding floating IP address intended to be used in the IP Address field, and choose the corresponding management interface (eth0) from the new instance where this floating IP is going to be assigned in the Port to be associated. Refer to the next image as an example of this procedure.
Step 8. Finally, click Associate.
Step 1. In the Horizon top menu, navigate to Project > Instances.
Step 2. Click on the name of the instance/VM that was created in section Launch a new instance.
Step 3. Click on Console tab. This will display the CLI of the VM.
Step 4. Once the CLI is displayed, enter the proper login credentials as shown in the image:
Username:root
Password:cisco123
Step 5. In the CLI, run the command vi /etc/ssh/sshd_config in order to edit SSH configuration.
Step 6. Once the SSH configuration file is open, press I to edit the file. Then look for the section and change the first line from PasswordAuthentication no to PasswordAuthentication yes as shown in this image.
Step 7. Press ESC and run :wq! in order to save sshd_config file changes.
Step 8. Run the command service sshd restart as shown in the image.
Step 9. In order to test SSH configuration changes have been correctly applied, open any SSH client and try to stablish a remote secure connection using the floating IP assigned to the instance (i.e. 10.145.0.249) and the user root as shown in the image.
Step 1. Open a SSH session with the IP address of the corresponding VM/server where the application is installed as shown in the image.
CPAR instance start
Follow these steps, once the activity has been completed and CPAR services can be re-established in the Site that was shut down.
Step 1. Login back to Horizon, navigate to Project > Instance > Start Instance
Step 2. Verify that the status of the instance is Active and the power state is Running as seen in this image.
9. Post-activity Health Check
Step 1. Run the command /opt/CSCOar/bin/arstatus at OS level:
[root@wscaaa04 ~]# /opt/CSCOar/bin/arstatus Cisco Prime AR RADIUS server running (pid: 24834) Cisco Prime AR Server Agent running (pid: 24821) Cisco Prime AR MCD lock manager running (pid: 24824) Cisco Prime AR MCD server running (pid: 24833) Cisco Prime AR GUI running (pid: 24836) SNMP Master Agent running (pid: 24835) [root@wscaaa04 ~]#
Step 2. Run the command /opt/CSCOar/bin/aregcmd at OS level and enter the admin credentials. Verify that CPAR Health is 10 out of 10 and the exit CPAR CLI.
[root@aaa02 logs]# /opt/CSCOar/bin/aregcmd Cisco Prime Access Registrar 7.3.0.1 Configuration Utility Copyright (C) 1995-2017 by Cisco Systems, Inc. All rights reserved. Cluster: User: admin Passphrase: Logging in to localhost [ //localhost ] LicenseInfo = PAR-NG-TPS 7.2(100TPS:) PAR-ADD-TPS 7.2(2000TPS:) PAR-RDDR-TRX 7.2() PAR-HSS 7.2() Radius/ Administrators/ Server 'Radius' is Running, its health is 10 out of 10 --> exit
Step 3. Run the command netstat | grep diameter and verify that all DRA connections are established.
The output mentioned here is for an environment where Diameter links are expected. If fewer links are displayed, this represents a disconnection from the DRA that needs to be analyzed.
[root@aa02 logs]# netstat | grep diameter tcp 0 0 aaa02.aaa.epc.:77 mp1.dra01.d:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:36 tsa6.dra01:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:47 mp2.dra01.d:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:07 tsa5.dra01:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:08 np2.dra01.d:diameter ESTABLISHED
Step 4. Check that the TPS log shows requests being processed by CPAR. The values highlighted represent the TPS and those are the ones you need to pay attention to.
The value of TPS must not exceed 1500.
[root@wscaaa04 ~]# tail -f /opt/CSCOar/logs/tps-11-21-2017.csv 11-21-2017,23:57:35,263,0 11-21-2017,23:57:50,237,0 11-21-2017,23:58:05,237,0 11-21-2017,23:58:20,257,0 11-21-2017,23:58:35,254,0 11-21-2017,23:58:50,248,0 11-21-2017,23:59:05,272,0 11-21-2017,23:59:20,243,0 11-21-2017,23:59:35,244,0 11-21-2017,23:59:50,233,0
Step 5. Look for any “error” or “alarm” messages in name_radius_1_log
[root@aaa02 logs]# grep -E "error|alarm" name_radius_1_log
Step 6. Verify the amount of memory that the CPAR process uses by running command:
top | grep radius
[root@sfraaa02 ~]# top | grep radius 27008 root 20 0 20.228g 2.413g 11408 S 128.3 7.7 1165:41 radius
This highlighted value must be lower than 7Gb, which is the maximum allowed at application level.
Identify the VMs that are hosted on the OSD-Compute server.
[stack@director ~]$ nova list --field name,host | grep osd-compute-0 | 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | pod2-stack-compute-4.localdomain |
Note: In the output shown here, the first column corresponds to the UUID, the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.
Backup: SNAPSHOT PROCESS
Step 1. Open any SSH client connected to the TMO Production network and connect to the CPAR instance.
It is important not to shut down all 4 AAA instances within one site at the same time, do it in a one by one fashion.
Step 2. In order to shut down CPAR application, run the command:
/opt/CSCOar/bin/arserver stop
A Message “Cisco Prime Access Registrar Server Agent shutdown complete.” must show up.
Note: If a user left a CLI session open, the arserver stop command won’t work and this message is displayed:
ERROR: You cannot shut down Cisco Prime Access Registrar while the CLI is being used. Current list of running CLI with process id is: 2903 /opt/CSCOar/bin/aregcmd –s
In this example, the highlighted process id 2903 needs to be terminated before CPAR can be stopped. If this is the case, terminate the process by running the command:
kill -9 *process_id*
Then repeat the step 1.
Step 3. Verify that the CPAR application was indeed shutdown by running the command:
/opt/CSCOar/bin/arstatus
These messages must appear:
Cisco Prime Access Registrar Server Agent not running Cisco Prime Access Registrar GUI not running
Step 1. Enter the Horizon GUI website that corresponds to the Site (City) currently being worked on.
When you access Horizon, this screen can be observed.
Step 2. Navigate to Project > Instances as shown in this image.
If the user used was CPAR, then only the 4 AAA instances can appear in this menu.
Step 3. Shut down only one instance at a time, repeat the whole process in this document. In order to shutdown the VM, navigate to Actions > Shut Off Instance as shown in the image and confirm your selection.
Step 4. Validate that the instance was indeed shut down by checking the Status = Shutoff and Power State = Shut Down as shown in the image.
This step ends the CPAR shutdown process.
Once the CPAR VMs are down, the snapshots can be taken in parallel, as they belong to independent computes.
The four QCOW2 files are created in parallel.
Take a snapshot of each AAA instance. (25 minutes -1 hour) (25 minutes for instances that used a qcow image as a source and 1 hour for instances that user a raw image as a source)
3. Click Create Snapshot in order to proceed with snapshot creation (this needs to be executed on the corresponding AAA instance) as shown in the image.
4. Once the snapshot is executed, navigate to the Images menu and verify that all finish and report no problems as seen in this image.
5.The next step is to download the snapshot on a QCOW2 format and transfer it to a remote entity, in case the OSPD is lost during this process. In order to achieve this, identify the snapshot by running the command glance image-list at OSPD level.
[root@elospd01 stack]# glance image-list +--------------------------------------+---------------------------+ | ID | Name | +--------------------------------------+---------------------------+ | 80f083cb-66f9-4fcf-8b8a-7d8965e47b1d | AAA-Temporary | | 22f8536b-3f3c-4bcc-ae1a-8f2ab0d8b950 | ELP1 cluman 10_09_2017 | | 70ef5911-208e-4cac-93e2-6fe9033db560 | ELP2 cluman 10_09_2017 | | e0b57fc9-e5c3-4b51-8b94-56cbccdf5401 | ESC-image | | 92dfe18c-df35-4aa9-8c52-9c663d3f839b | lgnaaa01-sept102017 | | 1461226b-4362-428b-bc90-0a98cbf33500 | tmobile-pcrf-13.1.1.iso | | 98275e15-37cf-4681-9bcc-d6ba18947d7b | tmobile-pcrf-13.1.1.qcow2 | +--------------------------------------+---------------------------+
6. Once you identify the snapshot to be downloaded (the one marked in green), you can download it on a QCOW2 format with the command glance image-download as depicted here.
[root@elospd01 stack]# glance image-download 92dfe18c-df35-4aa9-8c52-9c663d3f839b --file /tmp/AAA-CPAR-LGNoct192017.qcow2 &
7. Once the download process finishes, a compression process needs to be executed as that snapshot can be filled with ZEROES because of processes, tasks and temporary files handled by the OS. The command to be used for file compression is virt-sparsify.
[root@elospd01 stack]# virt-sparsify AAA-CPAR-LGNoct192017.qcow2 AAA-CPAR-LGNoct192017_compressed.qcow2
This process can take some time (around 10-15 minutes). Once finished, the resulting file is the one that needs to be transferred to an external entity as specified on next step.
Verification of the file integrity is required, in order to achieve this, run the next command and look for the “corrupt” attribute at the end of its output.
[root@wsospd01 tmp]# qemu-img info AAA-CPAR-LGNoct192017_compressed.qcow2 image: AAA-CPAR-LGNoct192017_compressed.qcow2 file format: qcow2 virtual size: 150G (161061273600 bytes) disk size: 18G cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false
Note: If the faulty component is to be replaced on OSD-Compute node, put the Ceph into Maintenance on the server before you proceed with the component replacement.
[heat-admin@pod2-stack-osd-compute-0 ~]$ sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 13.07996 root default
-2 4.35999 host pod2-stack-osd-compute-0
0 1.09000 osd.0 up 1.00000 1.00000
3 1.09000 osd.3 up 1.00000 1.00000
6 1.09000 osd.6 up 1.00000 1.00000
9 1.09000 osd.9 up 1.00000 1.00000
-3 4.35999 host pod2-stack-osd-compute-1
1 1.09000 osd.1 up 1.00000 1.00000
4 1.09000 osd.4 up 1.00000 1.00000
7 1.09000 osd.7 up 1.00000 1.00000
10 1.09000 osd.10 up 1.00000 1.00000
-4 4.35999 host pod2-stack-osd-compute-2
2 1.09000 osd.2 up 1.00000 1.00000
5 1.09000 osd.5 up 1.00000 1.00000
8 1.09000 osd.8 up 1.00000 1.00000
11 1.09000 osd.11 up 1.00000 1.00000
[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd set norebalance
[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd set noout
[root@pod2-stack-osd-compute-0 ~]# sudo ceph status
cluster eb2bb192-b1c9-11e6-9205-525400330666
health HEALTH_WARN
noout,norebalance,sortbitwise,require_jewel_osds flag(s) set
monmap e1: 3 mons at {pod2-stack-controller-0=11.118.0.10:6789/0,pod2-stack-controller-1=11.118.0.11:6789/0,pod2-stack-controller-2=11.118.0.12:6789/0}
election epoch 10, quorum 0,1,2 pod2-stack-controller-0,pod2-stack-controller-1,pod2-stack-controller-2
osdmap e79: 12 osds: 12 up, 12 in
flags noout,norebalance,sortbitwise,require_jewel_osds
pgmap v22844323: 704 pgs, 6 pools, 804 GB data, 423 kobjects
2404 GB used, 10989 GB / 13393 GB avail
704 active+clean
client io 3858 kB/s wr, 0 op/s rd, 546 op/s wr
Note: When CEPH is removed, VNF HD RAID goes into Degraded state but hd-disk must still be accessible.
[stack@director ~]$ nova stop aaa2-21 Request to stop server aaa2-21 has been accepted. [stack@director ~]$ nova list +--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+ | 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | ACTIVE | - | Running | tb1-mgmt=172.16.181.14, 10.225.247.233; radius-routable1=10.160.132.245; diameter-routable1=10.160.132.231 | | 3bc14173-876b-4d56-88e7-b890d67a4122 | aaa2-21 | SHUTOFF | - | Shutdown | diameter-routable1=10.160.132.230; radius-routable1=10.160.132.248; tb1-mgmt=172.16.181.7, 10.225.247.234 | | f404f6ad-34c8-4a5f-a757-14c8ed7fa30e | aaa21june | ACTIVE | - | Running | diameter-routable1=10.160.132.233; radius-routable1=10.160.132.244; tb1-mgmt=172.16.181.10 | +--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+
Power off the specified server. The steps in order to replace a faulty component on UCS C240 M4 server can be referred from:
Replacing the Server Components
[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd unset norebalance
[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd unset noout
[root@pod2-stack-osd-compute-0 ~]# sudo ceph status
cluster eb2bb192-b1c9-11e6-9205-525400330666
health HEALTH_OK
monmap e1: 3 mons at {pod2-stack-controller-0=11.118.0.10:6789/0,pod2-stack-controller-1=11.118.0.11:6789/0,pod2-stack-controller-2=11.118.0.12:6789/0}
election epoch 10, quorum 0,1,2 pod2-stack-controller-0,pod2-stack-controller-1,pod2-stack-controller-2
osdmap e81: 12 osds: 12 up, 12 in
flags sortbitwise,require_jewel_osds
pgmap v22844355: 704 pgs, 6 pools, 804 GB data, 423 kobjects
2404 GB used, 10989 GB / 13393 GB avail
704 active+clean
client io 3658 kB/s wr, 0 op/s rd, 502 op/s wr
Recovery Process
It is possible to redeploy the previous instance with the snapshot taken in previous steps.
Step 1. [OPTIONAL] If there is no previous VMsnapshot available then connect to the OSPD node where the backup was sent and sftp the backup back to its original OSPD node. Using sftproot@x.x.x.x where x.x.x.x is the IP of a the original OSPD. Save the snapshot file in /tmp directory.
Step 2. Connect to the OSPD node where the instance will be re-deploy.
Source the environment variables with this command:
# source /home/stack/pod1-stackrc-Core-CPAR
Step 3. In order to use the snapshot as an image it is necessary to upload it to horizon as such. Run the next command to do so.
#glance image-create -- AAA-CPAR-Date-snapshot.qcow2 --container-format bare --disk-format qcow2 --name AAA-CPAR-Date-snapshot
The process can be seen in horizon.
Step 4. In Horizon, navigate to Project > Instances and click on Lauch Instance as shown in this image.
Step 5. Enter Instance Name and choose the Availability Zone as shown in the image.
Step 6. In the Source tab choose the image to create the instance. In the Select Boot Source menu select Image, a list of images are shown, choose the one that was previously uploaded by clickin on its + sign.
Step 7. In the Flavor tab, choose the AAA flavor by clicking on the + sign.
Step 8. Finally, navigate to the Networks tab and choose the networks that the instance will need by clicking on the + sign. For this case, select diameter-soutable1, radius-routable1 and tb1-mgmt as shown in this image.
Finally, click on Launch Instance to create it. The progress can be monitored in Horizon:
After a few minutes the instance will be completely deployed and ready for use.
Create and assign a floating IP address
A floating IP address is a routable address, which means that it’s reachable from the outside of Ultra M/Openstack architecture, and it’s able to communicate with other nodes from the network.
Step 1. In the Horizon top menu, navigate to Admin > Floating IPs.
Step 2. Click Allocate IP to Project.
Step 3. In the Allocate Floating IP window select the Pool from which the new floating IP belongs, the Project where it is going to be assigned, and the new Floating IP Address itself.
For example:
Step 4. Click Allocate Floating IP.
Step 5. In the Horizon top menu, navigate to Project > Instances.
Step 6. In the Action column click on the arrow that points down in the Create Snapshot button, a menu must be displayed. Select Associate Floating IP option.
Step 7. Select the corresponding floating IP address intended to be used in the IP Address field, and choose the corresponding management interface (eth0) from the new instance where this floating IP is going to be assigned in the Port to be associated. Refer to the next image as an example of this procedure.
Step 8. Finally, click Associate.
Enable SSH
Step 1. In the Horizon top menu, navigate to Project > Instances.
Step 2. Click on the name of the instance/VM that was created in section Launch a new instance.
Step 3. Click on Console tab. This will display the command line interface of the VM.
Step 4. Once the CLI is displayed, enter the proper login credentials as shown in the image:
Username:root
Password:cisco123
Step 5. In the CLI, run the command vi /etc/ssh/sshd_config in order to edit ssh configuration.
Step 6. Once the ssh configuration file is open, press I to edit the file. Then look for this section and change the first line from PasswordAuthentication no to PasswordAuthentication yes.
Step 7. Press ESC and enter :wq!t o save sshd_config file changes.
Step 8. Run the command service sshd restart.
Step 9. In order to test SSH configuration changes have been correctly applied, open any SSH client and try to establish a remote secure connection using the floating IP assigned to the instance (i.e. 10.145.0.249) and the user root.
Establish SSH session
Step 1. Open a SSH session using the IP address of the corresponding VM/server where the application is installed.
CPAR instance start
Follow these steps, once the activity has been completed and CPAR services can be re-established in the Site that was shut down.
Step 1. Login back to Horizon, navigate to Project > Instance > Start Instance.
Step 2. Verify that the status of the instance is Active and the power state is Running as shown in the image.
9. Post-activity Health Check
Step 1. Run the command /opt/CSCOar/bin/arstatus at OS level
[root@wscaaa04 ~]# /opt/CSCOar/bin/arstatus Cisco Prime AR RADIUS server running (pid: 24834) Cisco Prime AR Server Agent running (pid: 24821) Cisco Prime AR MCD lock manager running (pid: 24824) Cisco Prime AR MCD server running (pid: 24833) Cisco Prime AR GUI running (pid: 24836) SNMP Master Agent running (pid: 24835) [root@wscaaa04 ~]#
Step 2. Run the command /opt/CSCOar/bin/aregcmd at OS level and enter the admin credentials. Verify that CPAr Health is 10 out of 10 and the exit CPAR CLI.
[root@aaa02 logs]# /opt/CSCOar/bin/aregcmd Cisco Prime Access Registrar 7.3.0.1 Configuration Utility Copyright (C) 1995-2017 by Cisco Systems, Inc. All rights reserved. Cluster: User: admin Passphrase: Logging in to localhost [ //localhost ] LicenseInfo = PAR-NG-TPS 7.2(100TPS:) PAR-ADD-TPS 7.2(2000TPS:) PAR-RDDR-TRX 7.2() PAR-HSS 7.2() Radius/ Administrators/ Server 'Radius' is Running, its health is 10 out of 10 --> exit
Step 3. Run the command netstat | grep diameter and verify that all DRA connections are established.
The output mentioned here is for an environment where Diameter links are expected. If fewer links are displayed, this represents a disconnection from the DRA that needs to be analyzed.
[root@aa02 logs]# netstat | grep diameter tcp 0 0 aaa02.aaa.epc.:77 mp1.dra01.d:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:36 tsa6.dra01:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:47 mp2.dra01.d:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:07 tsa5.dra01:diameter ESTABLISHED tcp 0 0 aaa02.aaa.epc.:08 np2.dra01.d:diameter ESTABLISHED
Step 4. Check that the TPS log shows requests being processed by CPAR. The values highlighted represent the TPS and those are the ones you need to pay attention to.
The value of TPS must not exceed 1500.
[root@wscaaa04 ~]# tail -f /opt/CSCOar/logs/tps-11-21-2017.csv 11-21-2017,23:57:35,263,0 11-21-2017,23:57:50,237,0 11-21-2017,23:58:05,237,0 11-21-2017,23:58:20,257,0 11-21-2017,23:58:35,254,0 11-21-2017,23:58:50,248,0 11-21-2017,23:59:05,272,0 11-21-2017,23:59:20,243,0 11-21-2017,23:59:35,244,0 11-21-2017,23:59:50,233,0
Step 5. Look for any “error” or “alarm” messages in name_radius_1_log
[root@aaa02 logs]# grep -E "error|alarm" name_radius_1_log
Step 6. Verify the amount of memory that the CPAR process uses by running the command:
top | grep radius
[root@sfraaa02 ~]# top | grep radius 27008 root 20 0 20.228g 2.413g 11408 S 128.3 7.7 1165:41 radius
This highlighted value must be lower than 7Gb, which is the maximum allowed at application level.
Note: A healthy cluster requires 2 active controllers so verify that the two controllers that remain are Online and active.
[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jul 6 09:03:37 2018Last change: Fri Jul 6 09:03:35 2018 by root via crm_attribute on pod2-stack-controller-0
3 nodes and 19 resources configured
Online: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
Full list of resources:
ip-11.120.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
Clone Set: haproxy-clone [haproxy]
Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
Master/Slave Set: galera-master [galera]
Masters: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
Slaves: [ pod2-stack-controller-0 ]
ip-192.200.0.110(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
ip-11.120.0.44(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
ip-11.118.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
Clone Set: rabbitmq-clone [rabbitmq]
Started: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
Stopped: [ pod2-stack-controller-0 ]
ip-10.225.247.214(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
Master/Slave Set: redis-master [redis]
Masters: [ pod2-stack-controller-2 ]
Slaves: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
ip-11.119.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
openstack-cinder-volume(systemd:openstack-cinder-volume):Started pod2-stack-controller-1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs cluster standby
[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jul 6 09:03:10 2018Last change: Fri Jul 6 09:03:06 2018 by root via crm_attribute on pod2-stack-controller-0
3 nodes and 19 resources configured
Node pod2-stack-controller-0: standby
Online: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
Full list of resources:
ip-11.120.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
Clone Set: haproxy-clone [haproxy]
Started: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
Stopped: [ pod2-stack-controller-0 ]
Master/Slave Set: galera-master [galera]
Masters: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
ip-192.200.0.110(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
ip-11.120.0.44(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
ip-11.118.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
Clone Set: rabbitmq-clone [rabbitmq]
Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
ip-10.225.247.214(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
Master/Slave Set: redis-master [redis]
Masters: [ pod2-stack-controller-2 ]
Slaves: [ pod2-stack-controller-1 ]
Stopped: [ pod2-stack-controller-0 ]
ip-11.119.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
openstack-cinder-volume(systemd:openstack-cinder-volume):Started pod2-stack-controller-1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Also, pcs status on the other 2 controllers must show the node as standby.
Power off the specified server. The steps in order to replace a faulty component on UCS C240 M4 server can be referred from:
Replacing the Server Components
[stack@director ~]$ source stackrc
[stack@director ~]$ nova list
+--------------------------------------+--------------------------+--------+------------+-------------+------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------------------+--------+------------+-------------+------------------------+
| 03f15071-21aa-4bcf-8fdd-acdbde305168 | pod2-stack-compute-0 | ACTIVE | - | Running | ctlplane=192.200.0.106 |
| 1f725ce3-948d-49e9-aed9-b99e73d82644 | pod2-stack-compute-1 | ACTIVE | - | Running | ctlplane=192.200.0.107 |
| fbc13c78-dc06-4ac9-a3c5-595ccc147adc | pod2-stack-compute-2 | ACTIVE | - | Running | ctlplane=192.200.0.119 |
| 3b94e0b1-47dc-4960-b3eb-d02ffe9ae693 | pod2-stack-compute-3 | ACTIVE | - | Running | ctlplane=192.200.0.112 |
| 5dbac94d-19b9-493e-a366-1e2e2e5e34c5 | pod2-stack-compute-4 | ACTIVE | - | Running | ctlplane=192.200.0.116 |
| b896c73f-d2c8-439c-bc02-7b0a2526dd70 | pod2-stack-controller-0 | ACTIVE | - | Running | ctlplane=192.200.0.113 |
| 2519ce67-d836-4e5f-a672-1a915df75c7c | pod2-stack-controller-1 | ACTIVE | - | Running | ctlplane=192.200.0.105 |
| e19b9625-5635-4a52-a369-44310f3e6a21 | pod2-stack-controller-2 | ACTIVE | - | Running | ctlplane=192.200.0.120 |
| 6810c884-1cb9-4321-9a07-192443920f1f | pod2-stack-osd-compute-0 | ACTIVE | - | Running | ctlplane=192.200.0.109 |
| 26d3f7b1-ba97-431f-aa6e-ba91661db45d | pod2-stack-osd-compute-1 | ACTIVE | - | Running | ctlplane=192.200.0.117 |
| 6e4a8aa9-4870-465a-a7e2-0932ff55e34b | pod2-stack-osd-compute-2 | ACTIVE | - | Running | ctlplane=192.200.0.103 |
+--------------------------------------+--------------------------+--------+------------+-------------+------------------------+
[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs cluster unstandby
[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jul 6 09:03:37 2018Last change: Fri Jul 6 09:03:35 2018 by root via crm_attribute on pod2-stack-controller-0
3 nodes and 19 resources configured
Online: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
Full list of resources:
ip-11.120.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
Clone Set: haproxy-clone [haproxy]
Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
Master/Slave Set: galera-master [galera]
Masters: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
Slaves: [ pod2-stack-controller-0 ]
ip-192.200.0.110(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
ip-11.120.0.44(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
ip-11.118.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
Clone Set: rabbitmq-clone [rabbitmq]
Started: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
Stopped: [ pod2-stack-controller-0 ]
ip-10.225.247.214(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
Master/Slave Set: redis-master [redis]
Masters: [ pod2-stack-controller-2 ]
Slaves: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
ip-11.119.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
openstack-cinder-volume(systemd:openstack-cinder-volume):Started pod2-stack-controller-1
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[heat-admin@pod2-stack-controller-0 ~]$ sudo ceph -s
cluster eb2bb192-b1c9-11e6-9205-525400330666
health HEALTH_OK
monmap e1: 3 mons at {pod2-stack-controller-0=11.118.0.10:6789/0,pod2-stack-controller-1=11.118.0.11:6789/0,pod2-stack-controller-2=11.118.0.12:6789/0}
election epoch 10, quorum 0,1,2 pod2-stack-controller-0,pod2-stack-controller-1,pod2-stack-controller-2
osdmap e81: 12 osds: 12 up, 12 in
flags sortbitwise,require_jewel_osds
pgmap v22844355: 704 pgs, 6 pools, 804 GB data, 423 kobjects
2404 GB used, 10989 GB / 13393 GB avail
704 active+clean
client io 3658 kB/s wr, 0 op/s rd, 502 op/s wr