Replacement of Controller Server UCS C240 M4 - CPAR

Available Languages

Download Options

PDF (440.7 KB)
View with Adobe Reader on a variety of devices
ePub (1.2 MB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (616.7 KB)
View on Kindle device or Kindle app on multiple devices

Updated:September 6, 2018

Document ID:213662

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

This document describes the steps required to replace a faulty controller server in an Ultra-M setup.

This procedure applies for an Openstack environment using NEWTON version where Elastic Services Controller (ESC) does not manage Cisco Prime Access Registrar (CPAR) and CPAR is installed directly on the VM deployed on Openstack.

Background Information

Ultra-M is a pre-packaged and validated virtualized mobile packet core solution that is designed in order to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:

Compute
Object Storage Disk - Compute (OSD - Compute)
Controller
OpenStack Platform - Director (OSPD)

The high-level architecture of Ultra-M and the components involved are shown in this image:

This document is intended for Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps that are required to be carried out at OpenStack and Redhat OS.

Note: Ultra M 5.1.x release is considered in order to define the procedures in this document.

Abbreviations

MOP	Method of Procedure
OSD	Object Storage Disks
OSPD	OpenStack Platform - Director
HDD	Hard Disk Drive
SSD	Solid State Drive
VIM	Virtual Infrastructure Manager
VM	Virtual Machine
EM	Element Manager
UAS	Ultra Automation Services
UUID	Universally Unique IDentifier

Workflow of the MoP

This image shows the high level workflow of the replacement procedure.

Prerequisites

Backup

In case of recovery, Cisco recommends to take a backup of the OSPD database (DB) with the use of these steps:

[root@director ~]# mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
[root@director ~]# tar --xattrs -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql 
/etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack
tar: Removing leading `/' from member names

Preliminary Status Check

It is important to check the current status of the OpenStack environment and services and ensure that it is healthy before you proceed with the replacement procedure. It can help avoid complications at the time of the Controller replacement process.

Step 1. Check the status of OpenStack and the node list:

[stack@director ~]$ source stackrc 
[stack@director ~]$ openstack stack list --nested
[stack@director ~]$ ironic node-list
[stack@director ~]$ nova list

Step 2. Check the Pacemaker status on controllers:

Login to one of the active controllers and check the pacemaker status. All services should run on the available controllers and stopped on the failed controller.

[stack@pod2-stack-controller-0 ~]# pcs status

<snip>
Online: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
Offline: [ pod2-stack-controller-2 ]

Full list of resources:
 ip-11.120.0.109 (ocf::heartbeat:IPaddr2): Started pod2-stack-controller-0
 ip-172.25.22.109 (ocf::heartbeat:IPaddr2): Started pod2-stack-controller-1
 ip-192.200.0.107 (ocf::heartbeat:IPaddr2): Started pod2-stack-controller-0

 Clone Set: haproxy-clone [haproxy]
 Started: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
 Stopped: [ pod2-stack-controller-2 ]

 Master/Slave Set: galera-master [galera]
 Masters: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
 Stopped: [ pod2-stack-controller-2 ]
 ip-11.120.0.110 (ocf::heartbeat:IPaddr2): Started pod2-stack-controller-0
 ip-11.119.0.110 (ocf::heartbeat:IPaddr2): Started pod2-stack-controller-1

 Clone Set: rabbitmq-clone [rabbitmq]
 Started: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
 Stopped: [ pod2-stack-controller-2 ]

 Master/Slave Set: redis-master [redis]
 Masters: [ pod2-stack-controller-0 ]
 Slaves: [ pod2-stack-controller-1 ]
 Stopped: [ pod2-stack-controller-2 ]

 ip-11.118.0.104 (ocf::heartbeat:IPaddr2): Started pod2-stack-controller-1
 openstack-cinder-volume (systemd:openstack-cinder-volume): Started pod2-stack-controller-0

 my-ipmilan-for-controller-6 (stonith:fence_ipmilan): Started pod2-stack-controller-1
 my-ipmilan-for-controller-4 (stonith:fence_ipmilan): Started pod2-stack-controller-0
 my-ipmilan-for-controller-7 (stonith:fence_ipmilan): Started pod2-stack-controller-0

Failed Actions:
Daemon Status:

 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled

In this example, Controller-2 is offline. It is, therefore, replaced. Controller-0 and Controller-1 are operational and runs the cluster services.

Step 3. Check MariaDB status in the active controllers:

[stack@director] nova list | grep control
| b896c73f-d2c8-439c-bc02-7b0a2526dd70 | pod2-stack-controller-0 | ACTIVE | - | Running | ctlplane=192.200.0.113 |
| 2519ce67-d836-4e5f-a672-1a915df75c7c | pod2-stack-controller-1 | ACTIVE | - | Running | ctlplane=192.200.0.105 |

[stack@director ~]$ for i in 192.200.0.113 192.200.0.105 ; do echo "*** $i ***" ; ssh heat-admin@$i "sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_local_state_comment'\" ; sudo mysql --exec=\"SHOW STATUS LIKE 'wsrep_cluster_size'\""; done

*** 192.200.0.113 ***
Variable_nameValue
wsrep_local_state_comment   Synced
Variable_nameValue
wsrep_cluster_size          2
*** 192.200.0.105 ***
Variable_nameValue
wsrep_local_state_comment   Synced
Variable_nameValue
wsrep_cluster_size          2

Verify that these lines are present for each active controller:

wsrep_local_state_comment: Synced

wsrep_cluster_size: 2

Steo 4. Check Rabbitmq status in the active controllers. The failed controller should not appear in the list of the nodes that run.

[heat-admin@pod2-stack-controller-0 ~] sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@pod2-stack-controller-0' ...
[{nodes,[{disc,['rabbit@pod2-stack-controller-0','rabbit@pod2-stack-controller-1',
                'rabbit@pod2stack-controller-2']}]},
{running_nodes,['rabbit@pod2-stack-controller-1',
                 'rabbit@pod2-stack-controller-0']},
 {cluster_name,<<"rabbit@pod2-stack-controller-2.localdomain">>},
 {partitions,[]},
 {alarms,[{'rabbit@pod2-stack-controller-1',[]},
          {'rabbit@pod2-stack-controller-0',[]}]}]


[heat-admin@pod2-stack-controller-1 ~] sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@pod2-stack-controller-1' ...
[{nodes,[{disc,['rabbit@pod2-stack-controller-0','rabbit@pod2-stack-controller-1',
                'rabbit@pod2-stack-controller-2']}]},
 {running_nodes,['rabbit@pod2-stack-controller-0',
                 'rabbit@pod2-stack-controller-1']},
 {cluster_name,<<"rabbit@pod2-stack-controller-2.localdomain">>},
 {partitions,[]},
 {alarms,[{'rabbit@pod2-stack-controller-0',[]},
          {'rabbit@pod2-stack-controller-1',[]}]}]

Step 5. Check if all the undercloud services are in loaded, active and running status from the OSP-D node.

[stack@director ~]$ systemctl list-units "openstack*" "neutron*" "openvswitch*"

UNIT                                       LOAD   ACTIVE SUB     DESCRIPTION

neutron-dhcp-agent.service                 loaded active running OpenStack Neutron DHCP Agent
neutron-openvswitch-agent.service          loaded active running OpenStack Neutron Open vSwitch Agent
neutron-ovs-cleanup.service                loaded active exited  OpenStack Neutron Open vSwitch Cleanup Utility
neutron-server.service                     loaded active running OpenStack Neutron Server
openstack-aodh-evaluator.service           loaded active running OpenStack Alarm evaluator service
openstack-aodh-listener.service            loaded active running OpenStack Alarm listener service
openstack-aodh-notifier.service            loaded active running OpenStack Alarm notifier service
openstack-ceilometer-central.service       loaded active running OpenStack ceilometer central agent
openstack-ceilometer-collector.service     loaded active running OpenStack ceilometer collection service
openstack-ceilometer-notification.service  loaded active running OpenStack ceilometer notification agent
openstack-glance-api.service               loaded active running OpenStack Image Service (code-named Glance) API server
openstack-glance-registry.service          loaded active running OpenStack Image Service (code-named Glance) Registry server
openstack-heat-api-cfn.service             loaded active running Openstack Heat CFN-compatible API Service
openstack-heat-api.service                 loaded active running OpenStack Heat API Service
openstack-heat-engine.service              loaded active running Openstack Heat Engine Service
openstack-ironic-api.service               loaded active running OpenStack Ironic API service
openstack-ironic-conductor.service         loaded active running OpenStack Ironic Conductor service
openstack-ironic-inspector-dnsmasq.service loaded active running PXE boot dnsmasq service for Ironic Inspector
openstack-ironic-inspector.service         loaded active running Hardware introspection service for OpenStack Ironic
openstack-mistral-api.service              loaded active running Mistral API Server
openstack-mistral-engine.service           loaded active running Mistral Engine Server
openstack-mistral-executor.service         loaded active running Mistral Executor Server
openstack-nova-api.service                 loaded active running OpenStack Nova API Server
openstack-nova-cert.service                loaded active running OpenStack Nova Cert Server
openstack-nova-compute.service             loaded active running OpenStack Nova Compute Server
openstack-nova-conductor.service           loaded active running OpenStack Nova Conductor Server
openstack-nova-scheduler.service           loaded active running OpenStack Nova Scheduler Server
openstack-swift-account-reaper.service     loaded active running OpenStack Object Storage (swift) - Account Reaper
openstack-swift-account.service            loaded active running OpenStack Object Storage (swift) - Account Server
openstack-swift-container-updater.service  loaded active running OpenStack Object Storage (swift) - Container Updater
openstack-swift-container.service          loaded active running OpenStack Object Storage (swift) - Container Server
openstack-swift-object-updater.service     loaded active running OpenStack Object Storage (swift) - Object Updater
openstack-swift-object.service             loaded active running OpenStack Object Storage (swift) - Object Server
openstack-swift-proxy.service              loaded active running OpenStack Object Storage (swift) - Proxy Server
openstack-zaqar.service                    loaded active running OpenStack Message Queuing Service (code-named Zaqar) Server
openstack-zaqar@1.service                  loaded active running OpenStack Message Queuing Service (code-named Zaqar) Server Instance 1
openvswitch.service                        loaded active exited  Open vSwitch

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

37 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Disable Fencing in the Controller Cluster

[root@pod2-stack-controller-0 ~]# sudo pcs property set stonith-enabled=false
[root@pod2-stack-controller-0 ~]# pcs property show

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: tripleo_cluster
 dc-version: 1.1.15-11.el7_3.4-e174ec8
 have-watchdog: false
 maintenance-mode: false
 redis_REPL_INFO: pod2-stack-controller-2
 stonith-enabled: false

Node Attributes:
 pod2-stack-controller-0: rmq-node-attr-last-known-rabbitmq=rabbit@pod2-stack-controller-0
 pod2-stack-controller-1: rmq-node-attr-last-known-rabbitmq=rabbit@pod2-stack-controller-1
 pod2-stack-controller-2: rmq-node-attr-last-known-rabbitmq=rabbit@pod2-stack-controller-2

Install the New Controller Node

The steps in order to install a new UCS C240 M4 server and the initial setup steps can be referred from: Cisco UCS C240 M4 Server Installation and Service Guide

Step 1. Login to the server with the use of the CIMC IP.

Step 2. Perform BIOS upgrade if the firmware is not as per the recommended version used previously. Steps for BIOS upgrade are given here: Cisco UCS C-Series Rack-Mount Server BIOS Upgrade Guide

Step 3. In order to verify the status of Physical drives, which is Unconfigured Good, navigate to Storage > Cisco 12G SAS Modular Raid Controller (SLOT-HBA) > Physical Drive Info, as shown in the image.

Step 4. In order to create a virtual drive from the physical drives with RAID Level 1, navigate to Storage > Cisco 12G SAS Modular Raid Controller (SLOT-HBA) > Controller Info > Create Virtual Drive from Unused Physical Drives.

Select the VD and configureSet as Boot Drive:

Step 5. In order to enable IPMI over LAN, navigate to Admin > Communication Services > Communication Services.

Step 6. In order to disable hyperthreading, navigate to Compute > BIOS > Configure BIOS > Advanced > Processor Configuration.

Note: The image shown here and the configuration steps mentioned in this section are with reference to the firmware version 3.0(3e) and there might be slight variations if you work on other versions.

Controller Node Replacement in Overcloud

This section covers the steps that are required in order to replace the faulty controller with the new one in the overcloud. For this, thedeploy.shscript that was used to bring up the stack would be re-used. At the time of the deployment, in the Controller Nodes Post Deployment phase, the update would fail due to some limitations in the Puppet modules. Manual intervention is required before you restart the deployment script.

Prepare to Remove Failed Controller Node

Step 1. Identify the index of the failed controller. The index is the numeric suffix on the controller name in the OpenStack server list output. In this example, the index is 2:

[stack@director ~]$ nova list | grep controller
| b896c73f-d2c8-439c-bc02-7b0a2526dd70 | pod2-stack-controller-0 | ACTIVE | - | Running | ctlplane=192.200.0.113 |
| 2519ce67-d836-4e5f-a672-1a915df75c7c | pod2-stack-controller-1 | ACTIVE | - | Running | ctlplane=192.200.0.105 |
| e19b9625-5635-4a52-a369-44310f3e6a21 | pod2-stack-controller-2 | ACTIVE | - | Running | ctlplane=192.200.0.120 |

Step 2. Create a Yaml file~templates/remove-controller.yamlthat would define the node to delete. Use the index found in the previous step for the entry in the resource list:

[stack@director ~]$ cat templates/remove-controller.yaml

parameters:
  ControllerRemovalPolicies:
    [{'resource_list': [‘2’]}]  

parameter_defaults:
  CorosyncSettleTries: 5

Step 3. Make a copy of the deploy script that is used in order to install the overcloud and insert a line in order to include theremove-controller.yamlfile created previously:

[stack@director ~]$ cp deploy.sh deploy-removeController.sh
[stack@director ~]$ cat deploy-removeController.sh
time openstack overcloud deploy --templates \
-r ~/custom-templates/custom-roles.yaml \
-e /home/stack/templates/remove-controller.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/neutron-sriov.yaml \
-e ~/custom-templates/network.yaml \
-e ~/custom-templates/ceph.yaml \
-e ~/custom-templates/compute.yaml \
-e ~/custom-templates/layout-removeController.yaml \
-e ~/custom-templates/rabbitmq.yaml \
--stack pod2-stack \
--debug \
--log-file overcloudDeploy_$(date +%m_%d_%y__%H_%M_%S).log \
--neutron-flat-networks phys_pcie1_0,phys_pcie1_1,phys_pcie4_0,phys_pcie4_1 \
--neutron-network-vlan-ranges datacentre:101:200 \ 
--neutron-disable-tunneling \ 
--verbose --timeout 180

Step 4. Identify the ID of the controller to be replaced, with the use of the commands mentioned here and move it to the maintenance mode:

[stack@director ~]$ nova list | grep controller

| b896c73f-d2c8-439c-bc02-7b0a2526dd70 | pod2-stack-controller-0 | ACTIVE | - | Running | ctlplane=192.200.0.113 |
| 2519ce67-d836-4e5f-a672-1a915df75c7c | pod2-stack-controller-1 | ACTIVE | - | Running | ctlplane=192.200.0.105 |
| e19b9625-5635-4a52-a369-44310f3e6a21 | pod2-stack-controller-2 | ACTIVE | - | Running | ctlplane=192.200.0.120 |


[stack@director ~]$ openstack baremetal node list | grep e19b9625-5635-4a52-a369-44310f3e6a21

| e7c32170-c7d1-4023-b356-e98564a9b85b | None | e19b9625-5635-4a52-a369-44310f3e6a21 | power off   | active             | False       |


[stack@b10-ospd ~]$ openstack baremetal node maintenance set e7c32170-c7d1-4023-b356-e98564a9b85b


[stack@director~]$ openstack baremetal node list | grep True

| e7c32170-c7d1-4023-b356-e98564a9b85b | None | e19b9625-5635-a369-44310f3e6a21 | power off   | active             | True        |

Step 5. In order to ensure that the DB runs at the time of the replacement procedure, remove Galera from pacemaker control and run this command on one of the active controllers:

[root@pod2-stack-controller-0 ~]# sudo pcs resource unmanage galera

[root@pod2-stack-controller-0 ~]# sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Thu Nov 16 16:51:18 2017                   Last change: Thu Nov 16 16:51:12 2017 by root via crm_resource on pod2-stack-controller-0
3 nodes and 22 resources configured
Online: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
OFFLINE: [ pod2-stack-controller-2 ]

Full list of resources:

 ip-11.120.0.109          (ocf::heartbeat:IPaddr2):        Started pod2-stack-controller-0
 ip-172.25.22.109        (ocf::heartbeat:IPaddr2):        Started pod2-stack-controller-1
 ip-192.200.0.107        (ocf::heartbeat:IPaddr2):        Started pod2-stack-controller-0

 Clone Set: haproxy-clone [haproxy]
     Started: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
     Stopped: [ pod2-stack-controller-2 ]

 Master/Slave Set: galera-master [galera] (unmanaged)
     galera         (ocf::heartbeat:galera):          Master pod2-stack-controller-0 (unmanaged)
     galera         (ocf::heartbeat:galera):          Master pod2-stack-controller-1 (unmanaged)

     Stopped: [ pod2-stack-controller-2 ]
 ip-11.120.0.110          (ocf::heartbeat:IPaddr2):        Started pod2-stack-controller-0
 ip-11.119.0.110          (ocf::heartbeat:IPaddr2):        Started pod2-stack-controller-1

<snip>

Prepare to Add New Controller Node

Step 1. Create acontrollerRMA.jsonfile with only the new Controller details. Ensure that the index number on the new Controller has not been used before. Typically, increment to the next highest controller number.

Example: Highest prior was Controller-2, so create Controller-3.

Note: Be mindful of the json format.

[stack@director ~]$ cat controllerRMA.json
{
    "nodes": [
        {
            "mac": [
                <MAC_ADDRESS>
            ],
            "capabilities": "node:controller-3,boot_option:local",
            "cpu": "24",
            "memory": "256000",
            "disk": "3000",
            "arch": "x86_64",
            "pm_type": "pxe_ipmitool",
            "pm_user": "admin",
            "pm_password": "<PASSWORD>",
            "pm_addr": "<CIMC_IP>"
        }
    ]
}

Step 2. Import the new node with the use of the json file created in the previous step:

[stack@director ~]$ openstack baremetal import --json controllerRMA.json

Started Mistral Workflow. Execution ID: 67989c8b-1225-48fe-ba52-3a45f366e7a0

Successfully registered node UUID 048ccb59-89df-4f40-82f5-3d90d37ac7dd

Started Mistral Workflow. Execution ID: c6711b5f-fa97-4c86-8de5-b6bc7013b398

Successfully set all nodes to available.

[stack@director ~]$ openstack baremetal node list | grep available

| 048ccb59-89df-4f40-82f5-3d90d37ac7dd | None | None                                 | power off   | available          | False

Step 3. Set the node to manage state:

[stack@director ~]$ openstack baremetal node manage 048ccb59-89df-4f40-82f5-3d90d37ac7dd 
[stack@director ~]$ openstack baremetal node list | grep off 
| 048ccb59-89df-4f40-82f5-3d90d37ac7dd | None | None | power off | manageable | False |

Step 4. Run introspection:

[stack@director ~]$ openstack overcloud node introspect 048ccb59-89df-4f40-82f5-3d90d37ac7dd --provide
Started Mistral Workflow. Execution ID: f73fb275-c90e-45cc-952b-bfc25b9b5727
Waiting for introspection to finish...
Successfully introspected all nodes.
Introspection completed.
Started Mistral Workflow. Execution ID: a892b456-eb15-4c06-b37e-5bc3f6c37c65
Successfully set all nodes to available

[stack@director ~]$ openstack baremetal node list | grep available
| 048ccb59-89df-4f40-82f5-3d90d37ac7dd | None | None                                 | power off   | available          | False       |

Step 5. Mark the available node with the new controller properties. Ensure to use the controller ID designated for the new controller, as used in thecontrollerRMA.jsonfile:

[stack@director ~]$ openstack baremetal node set --property capabilities='node:controller-3,profile:control,boot_option:local' 048ccb59-89df-4f40-82f5-3d90d37ac7dd

Step 6. In the deploy script, there is a custom-template calledlayout.yaml which among other things, specifies what IP addresses are assigned to the controllers for the various interfaces. On a new stack, there are 3 addresses defined for Controller-0, Controller-1, and Controller-2. When you add a new controller, ensure that you add a next IP address in sequence for each subnet:

ControllerIPs:
 internal_api:
 - 11.120.0.10
 - 11.120.0.11
 - 11.120.0.12
 - 11.120.0.13
 tenant:
 - 11.117.0.10
 - 11.117.0.11
 - 11.117.0.12
 - 11.117.0.13
 storage:
 - 11.118.0.10 
 - 11.118.0.11
 - 11.118.0.12
 - 11.118.0.13
 storage_mgmt:
 - 11.119.0.10 
 - 11.119.0.11
 - 11.119.0.12
 - 11.119.0.13

Step 7. Now run the deploy-removecontroller.shthat was previously created, in order to remove the old node and add the new node.

Note: This step is expected to fail in ControllerNodesDeployment_Step1. At that point, manual intervention is required.

[stack@b10-ospd ~]$ ./deploy-addController.sh
START with options: [u'overcloud', u'deploy', u'--templates', u'-r', u'/home/stack/custom-templates/custom-roles.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/neutron-sriov.yaml', u'-e', u'/home/stack/custom-templates/network.yaml', u'-e', u'/home/stack/custom-templates/ceph.yaml', u'-e', u'/home/stack/custom-templates/compute.yaml', u'-e', u'/home/stack/custom-templates/layout-removeController.yaml', u'-e', u'/home/stack/custom-templates/rabbitmq.yaml', u'--stack', u'newtonoc', u'--debug', u'--log-file', u'overcloudDeploy_11_15_17__07_46_35.log', u'--neutron-flat-networks', u'phys_pcie1_0,phys_pcie1_1,phys_pcie4_0,phys_pcie4_1', u'--neutron-network-vlan-ranges', u'datacentre:101:200', u'--neutron-disable-tunneling', u'--verbose', u'--timeout', u'180']
  :
DeploymentError: Heat Stack update failed
END return value: 1

real     42m1.525s
user     0m3.043s
sys      0m0.614s

The progress/status of the deployment can be monitored with these commands:

[stack@director~]$ openstack stack list --nested | grep -iv complete

+--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------+--------------------------------------+

| ID                                   | Stack Name                                                                                                                                                              | Stack Status    | Creation Time        | Updated Time         | Parent                               |

+--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------+--------------------------------------+

| c1e338f2-877e-4817-93b4-9a3f0c0b3d37 | pod2-stack-AllNodesDeploySteps-5psegydpwxij-ComputeDeployment_Step1-swnuzjixac43                                                                                          | UPDATE_FAILED   | 2017-10-08T14:06:07Z | 2017-11-16T18:09:43Z | e90f00ef-2499-4ec3-90b4-d7def6e97c47 |

| 1db4fef4-45d3-4125-bd96-2cc3297a69ff | pod2-stack-AllNodesDeploySteps-5psegydpwxij-ControllerDeployment_Step1-hmn3hpruubcn                                                                                       | UPDATE_FAILED   | 2017-10-08T14:03:05Z | 2017-11-16T18:12:12Z | e90f00ef-2499-4ec3-90b4-d7def6e97c47 |

| e90f00ef-2499-4ec3-90b4-d7def6e97c47 | pod2-stack-AllNodesDeploySteps-5psegydpwxij                                                                                                                               | UPDATE_FAILED   | 2017-10-08T13:59:25Z | 2017-11-16T18:09:25Z | 6c4b604a-55a4-4a19-9141-28c844816c0d |

| 6c4b604a-55a4-4a19-9141-28c844816c0d | pod2-stack                                                                                                                                                              | UPDATE_FAILED   | 2017-10-08T12:37:11Z | 2017-11-16T17:35:35Z | None                                 |

+--------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------+--------------------------------------+

Manual Intervention

Step 1. On the OSP-D server, run the OpenStack server list command in order to list the available controllers. The newly added controller should appear in the list:

[stack@director ~]$ openstack server list | grep controller-3
| 3e6c3db8-ba24-48d9-b0e8-1e8a2eb8b5ff | pod2-stack-controller-3  | ACTIVE | ctlplane=192.200.0.103 | overcloud-full |

Step 2. Connect to one of the active controllers (not the newly added controller) and look at the file/etc/corosync/corosycn.conf. Find thenodelistwhich assigns anodeidto each controller. Find the entry for the failed node and note itsnodeid:

[root@pod2-stack-controller-0 ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    secauth: off
    cluster_name: tripleo_cluster
    transport: udpu
    token: 10000
}

nodelist {
    node {
        ring0_addr: pod2-stack-controller-0
        nodeid: 5
    }
    node {
        ring0_addr: pod2-stack-controller-1
        nodeid: 7
    }
    node {
        ring0_addr: pod2-stack-controller-2
        nodeid: 8
    }
}

Step 3. Login to each of the active controllers. Remove the failed node and restart the service. In this case, removepod2-stack-controller-2. Do not perform this action on the newly added controller:

[root@pod2-stack-controller-0 ~]# sudo pcs cluster localnode remove pod2-stack-controller-2
pod2-stack-controller-2: successfully removed!
[root@pod2-stack-controller-0 ~]# sudo pcs cluster reload corosync
Corosync reloaded

[root@pod2-stack-controller-1 ~]# sudo pcs cluster localnode remove pod2-stack-controller-2
pod2-stack-controller-2: successfully removed!
[root@pod2-stack-controller-1 ~]# sudo pcs cluster reload corosync
Corosync reloaded

Step 4. Run this command from one of the active controllers in order to delete the failed node from the cluster:

[root@pod2-stack-controller-0 ~]# sudo crm_node -R pod2-stack-controller-2 --force

Step 5. Run this command from one of the active controllers in order to delete the failed node from therabbitmqcluster:

[root@pod2-stack-controller-0 ~]# sudo rabbitmqctl forget_cluster_node rabbit@pod2-stack-controller-2
Removing node 'rabbit@newtonoc-controller-2' from cluster ...

Step 6. Delete the failed node from the MongoDB. In order to do this, you need to find the active Mongo node. Usenetstatto find the IP address of the host:

[root@pod2-stack-controller-0 ~]# sudo netstat -tulnp | grep 27017
tcp        0      0 11.120.0.10:27017       0.0.0.0:*               LISTEN      219577/mongod

Step 7. Log in to the node and check in order to see if it is the master with the use of the IP address and port number from the previous command:

[heat-admin@pod2-stack-controller-0 ~]$ echo "db.isMaster()" | mongo --host 11.120.0.10:27017
MongoDB shell version: 2.6.11
connecting to: 11.120.0.10:27017/test
{
         "setName" : "tripleo",
         "setVersion" : 9,
         "ismaster" : true,
         "secondary" : false,
         "hosts" : [
                  "11.120.0.10:27017",
                  "11.120.0.12:27017",
                  "11.120.0.11:27017"
         ],
         "primary" : "11.120.0.10:27017",
         "me" : "11.120.0.10:27017",
         "electionId" : ObjectId("5a0d2661218cb0238b582fb1"),
         "maxBsonObjectSize" : 16777216,
         "maxMessageSizeBytes" : 48000000,
         "maxWriteBatchSize" : 1000,
         "localTime" : ISODate("2017-11-16T18:36:34.473Z"),
         "maxWireVersion" : 2,
         "minWireVersion" : 0,
         "ok" : 1
}

If the node is not the master, log in to the other active controller and perform the same step.

Step 1. From the master, list the available nodes with the use of thers.status()command. Find the old/unresponsive node and identify the mongo node name.

[root@pod2-stack-controller-0 ~]# mongo --host 11.120.0.10
MongoDB shell version: 2.6.11
connecting to: 11.120.0.10:27017/test
<snip>
tripleo:PRIMARY> rs.status()
{
         "set" : "tripleo",
         "date" : ISODate("2017-11-14T13:27:14Z"),
         "myState" : 1,
         "members" : [
                  {
                           "_id" : 0,
                           "name" : "11.120.0.10:27017",
                           "health" : 1,
                           "state" : 1,
                           "stateStr" : "PRIMARY",
                           "uptime" : 418347,
                           "optime" : Timestamp(1510666033, 1),
                           "optimeDate" : ISODate("2017-11-14T13:27:13Z"),
                           "electionTime" : Timestamp(1510247693, 1),
                           "electionDate" : ISODate("2017-11-09T17:14:53Z"),
                           "self" : true
                  },
                  {
                           "_id" : 2,
                           "name" : "11.120.0.12:27017",
                           "health" : 1,
                           "state" : 2,
                           "stateStr" : "SECONDARY",
                           "uptime" : 418347,
                           "optime" : Timestamp(1510666033, 1),
                           "optimeDate" : ISODate("2017-11-14T13:27:13Z"),
                           "lastHeartbeat" : ISODate("2017-11-14T13:27:13Z"),
                           "lastHeartbeatRecv" : ISODate("2017-11-14T13:27:13Z"),
                           "pingMs" : 0,
                           "syncingTo" : "11.120.0.10:27017"
                  },
                  {
                           "_id" : 3,
                           "name" : "11.120.0.11:27017
                           "health" : 0,
                           "state" : 8,
                           "stateStr" : "(not reachable/healthy)",
                           "uptime" : 0,
                           "optime" : Timestamp(1510610580, 1),
                           "optimeDate" : ISODate("2017-11-13T22:03:00Z"),
                           "lastHeartbeat" : ISODate("2017-11-14T13:27:10Z"),
                           "lastHeartbeatRecv" : ISODate("2017-11-13T22:03:01Z"),
                           "pingMs" : 0,
                           "syncingTo" : "11.120.0.10:27017"               
                  }
         ],
         "ok" : 1
}

Step 2. From the master, delete the failed node with the use of thers.removecommand. Some errors are seen when you execute this command, but check the status once more to find that the node has been removed:

[root@pod2-stack-controller-0 ~]$ mongo --host 11.120.0.10
<snip>
tripleo:PRIMARY> rs.remove('11.120.0.12:27017')
2017-11-16T18:41:04.999+0000 DBClientCursor::init call() failed
2017-11-16T18:41:05.000+0000 Error: error doing query: failed at src/mongo/shell/query.js:81
2017-11-16T18:41:05.001+0000 trying reconnect to 11.120.0.10:27017 (11.120.0.10) failed
2017-11-16T18:41:05.003+0000 reconnect 11.120.0.10:27017 (11.120.0.10) ok

tripleo:PRIMARY> rs.status()
{
         "set" : "tripleo",
         "date" : ISODate("2017-11-16T18:44:11Z"),
         "myState" : 1,
         "members" : [
                  {
                           "_id" : 3,
                           "name" : "11.120.0.11:27017",
                           "health" : 1,
                           "state" : 2,
                           "stateStr" : "SECONDARY",
                           "uptime" : 187,
                           "optime" : Timestamp(1510857848, 3),
                           "optimeDate" : ISODate("2017-11-16T18:44:08Z"),
                           "lastHeartbeat" : ISODate("2017-11-16T18:44:11Z"),
                           "lastHeartbeatRecv" : ISODate("2017-11-16T18:44:09Z"),
                           "pingMs" : 0,
                           "syncingTo" : "11.120.0.10:27017"
                  },
                  {
                           "_id" : 4,
                           "name" : "11.120.0.10:27017",
                           "health" : 1,
                           "state" : 1,
                           "stateStr" : "PRIMARY",
                           "uptime" : 89820,
                           "optime" : Timestamp(1510857848, 3),
                           "optimeDate" : ISODate("2017-11-16T18:44:08Z"),
                           "electionTime" : Timestamp(1510811232, 1),
                           "electionDate" : ISODate("2017-11-16T05:47:12Z"),
                           "self" : true
                  }
         ],
         "ok" : 1
}
tripleo:PRIMARY> exit
bye

Step 3. Run this command in order to update the list of active controller nodes. Include the new controller node in this list:

[root@pod2-stack-controller-0 ~]# sudo pcs resource update galera wsrep_cluster_address=gcomm://pod2-stack-controller-0,pod2-stack-controller-1,pod2-stack-controller-2

Step 4. Copy these files from a controller that already exists to the new controller:

/etc/sysconfig/clustercheck

/root/.my.cnf

On existing controller:

[root@pod2-stack-controller-0 ~]# scp /etc/sysconfig/clustercheck stack@192.200.0.1:/tmp/.
[root@pod2-stack-controller-0 ~]# scp /root/.my.cnf stack@192.200.0.1:/tmp/my.cnf

On new controller:

[root@pod2-stack-controller-3 ~]# cd /etc/sysconfig

[root@pod2-stack-controller-3 sysconfig]# scp stack@192.200.0.1:/tmp/clustercheck .

[root@pod2-stack-controller-3 sysconfig]# cd /root

[root@pod2-stack-controller-3 ~]# scp stack@192.200.0.1:/tmp/my.cnf .my.cnf

Step 5. Run thecluster node addcommand from one of the controllers that already exists:

[root@pod2-stack-controller-1 ~]# sudo pcs cluster node add pod2-stack-controller-3

Disabling SBD service...
pod2-stack-controller-3: sbd disabled
pod2-stack-controller-0: Corosync updated
pod2-stack-controller-1: Corosync updated

Setting up corosync...
pod2-stack-controller-3: Succeeded
Synchronizing pcsd certificates on nodes pod2-stack-controller-3...
pod2-stack-controller-3: Success

Restarting pcsd on the nodes in order to reload the certificates...
pod2-stack-controller-3: Success

Step 6. Log in to each controller and view the file/etc/corosync/corosync.conf. Ensure that the new controller is listed and that thenode idassigned to that controller is the next number in the sequence that has not been previously used. Ensure that this change is done on all 3 controllers:

[root@pod2-stack-controller-1 ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    secauth: off
    cluster_name: tripleo_cluster
    transport: udpu
    token: 10000
}
nodelist {
    node {
        ring0_addr: pod2-stack-controller-0
        nodeid: 5
    }
    node {
        ring0_addr: pod2-stack-controller-1
        nodeid: 7
    }
    node {
        ring0_addr: pod2-stack-controller-3
        nodeid: 6
    }
}
quorum {
    provider: corosync_votequorum
}
logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

For example/etc/corosync/corosync.confafter modification:


totem { 
    version: 2 
    secauth: off 
    cluster_name: tripleo_cluster 
    transport: udpu 
    token: 10000 
}
nodelist {
    node {
        ring0_addr: pod2-stack-controller-0
        nodeid: 5
    }
    node {
        ring0_addr: pod2-stack-controller-1
        nodeid: 7
    }
    node {
        ring0_addr: pod2-stack-controller-3
        nodeid: 9
    }
}
quorum {
    provider: corosync_votequorum
}
logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

Step 7. Restartcorosyncon the active controllers. Do not startcorosyncon the new controller:

[root@pod2-stack-controller-0 ~]# sudo pcs cluster reload corosync
[root@pod2-stack-controller-1 ~]# sudo pcs cluster reload corosync

Step 8. Start the new controller node from one of the acting controllers:

[root@pod2-stack-controller-1 ~]# sudo pcs cluster start pod2-stack-controller-3

Step 9. Restart Galera from one of the acting controllers:

[root@pod2-stack-controller-1 ~]#  sudo pcs cluster start pod2-stack-controller-3

pod2-stack-controller-0: Starting Cluster...

[root@pod2-stack-controller-1 ~]# sudo pcs resource cleanup galera
Cleaning up galera:0 on pod2-stack-controller-0, removing fail-count-galera
Cleaning up galera:0 on pod2-stack-controller-1, removing fail-count-galera
Cleaning up galera:0 on pod2-stack-controller-3, removing fail-count-galera
  * The configuration prevents the cluster from stopping or starting 'galera-master' (unmanaged)

Waiting for 3 replies from the CRMd... OK

[root@pod2-stack-controller-1 ~]#
[root@pod2-stack-controller-1 ~]# sudo pcs resource manage galera

Step 10. The cluster is in maintenance mode. Disable the maintenance mode in order to get the services to start:

[root@pod2-stack-controller-2 ~]# sudo pcs property set maintenance-mode=false --wait

Step 11. Check the PCs status for Galera until all 3 controllers are listed as masters in Galera:

Note: For large setups, it can take some time to sync DBs.

[root@pod2-stack-controller-1 ~]# sudo pcs status | grep galera -A1 

Master/Slave Set: galera-master [galera]
      Masters: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-3 ]

Step 12. Switch the cluster to maintenance mode:

[root@pod2-stack-controller-1~]# sudo pcs property set maintenance-mode=true --wait

[root@pod2-stack-controller-1 ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: pod2-stack-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
 Last updated: Thu Nov 16 19:17:01 2017                  Last change: Thu Nov 16 19:16:48 2017 by root via cibadmin on pod2-stack-controller-1
               *** Resource management is DISABLED ***
   The cluster will not attempt to start, stop or recover services

PCSD Status:
  pod2-stack-controller-3: Online
  pod2-stack-controller-0: Online
  pod2-stack-controller-1: Online

Step 13. Re-run the deploy script that you ran previously. This time it should succeed.

[stack@director ~]$ ./deploy-addController.sh
START with options: [u'overcloud', u'deploy', u'--templates', u'-r', u'/home/stack/custom-templates/custom-roles.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml', u'-e', u'/usr/share/openstack-tripleo-heat-templates/environments/neutron-sriov.yaml', u'-e', u'/home/stack/custom-templates/network.yaml', u'-e', u'/home/stack/custom-templates/ceph.yaml', u'-e', u'/home/stack/custom-templates/compute.yaml', u'-e', u'/home/stack/custom-templates/layout-removeController.yaml', u'--stack', u'newtonoc', u'--debug', u'--log-file', u'overcloudDeploy_11_14_17__13_53_12.log', u'--neutron-flat-networks', u'phys_pcie1_0,phys_pcie1_1,phys_pcie4_0,phys_pcie4_1', u'--neutron-network-vlan-ranges', u'datacentre:101:200', u'--neutron-disable-tunneling', u'--verbose', u'--timeout', u'180']
options: Namespace(access_key='', access_secret='***', access_token='***', access_token_endpoint='', access_token_type='', aodh_endpoint='', auth_type='', auth_url='https://192.200.0.2:13000/v2.0', authorization_code='', cacert=None, cert='', client_id='', client_secret='***', cloud='', consumer_key='', consumer_secret='***', debug=True, default_domain='default', default_domain_id='', default_domain_name='', deferred_help=False, discovery_endpoint='', domain_id='', domain_name='', endpoint='', identity_provider='', identity_provider_url='', insecure=None, inspector_api_version='1', inspector_url=None, interface='', key='', log_file=u'overcloudDeploy_11_14_17__13_53_12.log', murano_url='', old_profile=None, openid_scope='', os_alarming_api_version='2', os_application_catalog_api_version='1', os_baremetal_api_version='1.15', os_beta_command=False, os_compute_api_version='', os_container_infra_api_version='1', os_data_processing_api_version='1.1', os_data_processing_url='', os_dns_api_version='2', os_identity_api_version='', os_image_api_version='1', os_key_manager_api_version='1', os_metrics_api_version='1', os_network_api_version='', os_object_api_version='', os_orchestration_api_version='1', os_project_id=None, os_project_name=None, os_queues_api_version='2', os_tripleoclient_api_version='1', os_volume_api_version='', os_workflow_api_version='2', passcode='', password='***', profile=None, project_domain_id='', project_domain_name='', project_id='', project_name='admin', protocol='', redirect_uri='', region_name='', roles='', timing=False, token='***', trust_id='', url='', user='', user_domain_id='', user_domain_name='', user_id='', username='admin', verbose_level=3, verify=None)
Auth plugin password selected

Starting new HTTPS connection (1): 192.200.0.2
"POST /v2/action_executions HTTP/1.1" 201 1696
HTTP POST https://192.200.0.2:13989/v2/action_executions 201
Overcloud Endpoint: http://172.25.22.109:5000/v2.0
Overcloud Deployed
clean_up DeployOvercloud:
END return value: 0

real     54m17.197s
user     0m3.421s
sys       0m0.670s

Verify Overcloud Services in the Controller

Ensure that all managed services run properly on the controller nodes.

[heat-admin@pod2-stack-controller-2 ~]$ sudo pcs status

Finalize the L3 Agent Routers

Check the routers in order to ensure that L3 agents are properly hosted. Ensure to source the overcloudrc file when you perform this check.

Step 1. Find the router name:

[stack@director~]$ source corerc
[stack@director ~]$ neutron router-list

+--------------------------------------+------+-------------------------------------------------------------------+-------------+------+

| id                                   | name | external_gateway_info                                             | distributed | ha   |

+--------------------------------------+------+-------------------------------------------------------------------+-------------+------+

| d814dc9d-2b2f-496f-8c25-24911e464d02 | main | {"network_id": "18c4250c-e402-428c-87d6-a955157d50b5",            | False       | True |

In this example, the name of the router is main.

Step 2. List all L3 agents in order to find UUID of failed node and the new node:

[stack@director ~]$ neutron agent-list | grep "neutron-l3-agent"

| 70242f5c-43ab-4355-abd6-9277f92e4ce6 | L3 agent           | pod2-stack-controller-0.localdomain  | nova              | :-)   | True           | neutron-l3-agent          |
| 8d2ffbcb-b6ff-42cd-b5b8-da31d8da8a40 | L3 agent           | pod2-stack-controller-2.localdomain  | nova              | xxx   | True           | neutron-l3-agent          |
| a410a491-e271-4938-8a43-458084ffe15d | L3 agent           | pod2-stack-controller-3.localdomain  | nova              | :-)   | True           | neutron-l3-agent          |
| cb4bc1ad-ac50-42e9-ae69-8a256d375136 | L3 agent           | pod2-stack-controller-1.localdomain  | nova              | :-)   | True           | neutron-l3-agent          |

In this example, L3 agent that corresponds to pod2-stack-controller-2.localdomainshould be removed from the router and the one that corresponds to pod2-stack-controller-3.localdomainshould be added to the router:

[stack@director ~]$ neutron l3-agent-router-remove 8d2ffbcb-b6ff-42cd-b5b8-da31d8da8a40  main

Removed router main from L3 agent

[stack@director ~]$ neutron l3-agent-router-add a410a491-e271-4938-8a43-458084ffe15d main

Added router main to L3 agent

Step 3. Check updated list of L3-agents:

[stack@director ~]$ neutron l3-agent-list-hosting-router main

+--------------------------------------+-----------------------------------+----------------+-------+----------+
| id                                   | host                              | admin_state_up | alive | ha_state |
+--------------------------------------+-----------------------------------+----------------+-------+----------+

| 70242f5c-43ab-4355-abd6-9277f92e4ce6 | pod2-stack-controller-0.localdomain | True           | :-)   | standby  |
| a410a491-e271-4938-8a43-458084ffe15d | pod2-stack-controller-3.localdomain | True           | :-)   | standby  |
| cb4bc1ad-ac50-42e9-ae69-8a256d375136 | pod2-stack-controller-1.localdomain | True           | :-)   | active   |
+--------------------------------------+-----------------------------------+----------------+-------+----------+

Step 4. List any services that run from the removed controller node and remove them:

[stack@director ~]$ neutron  agent-list | grep controller-2

| 877314c2-3c8d-4666-a6ec-69513e83042d | Metadata agent     | pod2-stack-controller-2.localdomain  |                   | xxx   | True           | neutron-metadata-agent    |
| 8d2ffbcb-b6ff-42cd-b5b8-da31d8da8a40 | L3 agent           | pod2-stack-controller-2.localdomain  | nova              | xxx   | True           | neutron-l3-agent          |
| 911c43a5-df3a-49ec-99ed-1d722821ec20 | DHCP agent         | pod2-stack-controller-2.localdomain  | nova              | xxx   | True           | neutron-dhcp-agent        |
| a58a3dd3-4cdc-48d4-ab34-612a6cd72768 | Open vSwitch agent | pod2-stack-controller-2.localdomain  |                   | xxx   | True           | neutron-openvswitch-agent |

[stack@director ~]$ neutron agent-delete 877314c2-3c8d-4666-a6ec-69513e83042d 
Deleted agent(s): 877314c2-3c8d-4666-a6ec-69513e83042d 
[stack@director ~]$ neutron agent-delete 8d2ffbcb-b6ff-42cd-b5b8-da31d8da8a40
Deleted agent(s): 8d2ffbcb-b6ff-42cd-b5b8-da31d8da8a40 
[stack@director ~]$ neutron agent-delete 911c43a5-df3a-49ec-99ed-1d722821ec20
Deleted agent(s): 911c43a5-df3a-49ec-99ed-1d722821ec20 
[stack@director ~]$ neutron agent-delete a58a3dd3-4cdc-48d4-ab34-612a6cd72768 
Deleted agent(s): a58a3dd3-4cdc-48d4-ab34-612a6cd72768 

[stack@director ~]$ neutron agent-list | grep controller-2 
[stack@director ~]$

Finalize Compute Services

Step 1. Checknova service-listitems left from the removed node and delete them:

[stack@director ~]$ nova service-list | grep controller-2

| 615 | nova-consoleauth | pod2-stack-controller-2.localdomain  | internal          | enabled | down  | 2017-11-16T16:08:14.000000 | -               |
| 618 | nova-scheduler   | pod2-stack-controller-2.localdomain  | internal          | enabled | down  | 2017-11-16T16:08:13.000000 | -               |
| 621 | nova-conductor   | pod2-stack-controller-2.localdomain  | internal          | enabled | down  | 2017-11-16T16:08:14.000000 | -            

[stack@director ~]$ nova service-delete 615
[stack@director ~]$ nova service-delete 618
[stack@director ~]$ nova service-delete 621

stack@director ~]$ nova service-list | grep controller-2

Step 2. Ensure that theconsoleauthprocess runs on all the controllers or restart it with the use of this command:pcs resource restart openstack-nova-consoleauth:

[stack@director ~]$ nova service-list | grep consoleauth

| 601 | nova-consoleauth | pod2-stack-controller-0.localdomain  | internal          | enabled | up    | 2017-11-16T21:00:10.000000 | -               |
| 608 | nova-consoleauth | pod2-stack-controller-1.localdomain  | internal          | enabled | up    | 2017-11-16T21:00:13.000000 | -               |
| 622 | nova-consoleauth | pod2-stack-controller-3.localdomain  | internal          | enabled | up    | 2017-11-16T21:00:13.000000 | -

Restart Fencing on the Controller Nodes

Step 1. Check all controllers for IP route to the undercloud 192.0.0.0/8:

[root@pod2-stack-controller-3 ~]# ip route
default via 10.225.247.203 dev vlan101
10.225.247.128/25 dev vlan101 proto kernel scope link src 10.225.247.212
11.117.0.0/24 dev vlan17 proto kernel scope link src 11.117.0.10
11.118.0.0/24 dev vlan18 proto kernel scope link src 11.118.0.10
11.119.0.0/24 dev vlan19 proto kernel scope link src 11.119.0.10
11.120.0.0/24 dev vlan20 proto kernel scope link src 11.120.0.10
169.254.169.254 via 192.200.0.1 dev eno1
192.200.0.0/24 dev eno1 proto kernel scope link src 192.200.0.113

Step 2. Check the currentstonithconfiguration. Remove any reference to the old controller node:

[root@pod2-stack-controller-3 ~]# sudo pcs stonith show --full
 Resource: my-ipmilan-for-controller-6 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=pod2-stack-controller-1 ipaddr=192.100.0.1 login=admin passwd=Csco@123Starent lanplus=1
  Operations: monitor interval=60s (my-ipmilan-for-controller-6-monitor-interval-60s)
 Resource: my-ipmilan-for-controller-4 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=pod2-stack-controller-0 ipaddr=192.100.0.14 login=admin passwd=Csco@123Starent lanplus=1
  Operations: monitor interval=60s (my-ipmilan-for-controller-4-monitor-interval-60s)
Resource: my-ipmilan-for-controller-7 (class=stonith type=fence_ipmilan)
  Attributes: pcmk_host_list=pod2-stack-controller-2 ipaddr=192.100.0.15 login=admin passwd=Csco@123Starent lanplus=1
  Operations: monitor interval=60s (my-ipmilan-for-controller-7-monitor-interval-60s)

[root@pod2-stack-controller-3 ~]# pcs stonith delete my-ipmilan-for-controller-7
Attempting to stop: my-ipmilan-for-controller-7...Stopped

Step 3. Addstonithconfiguration for new controller:

[root@pod2-stack-controller-3 ~]sudo pcs stonith create my-ipmilan-for-controller-8 fence_ipmilan pcmk_host_list=pod2-stack-controller-3 ipaddr=<CIMC_IP> login=admin passwd=<PASSWORD> lanplus=1 op monitor interval=60s

Step 4. Restart fencing from any controller and verify the status:

[root@pod2-stack-controller-1 ~]# sudo pcs property set stonith-enabled=true 
[root@pod2-stack-controller-3 ~]# pcs status

<snip>
my-ipmilan-for-controller-1 (stonith:fence_ipmilan): Started pod2-stack-controller-3 
my-ipmilan-for-controller-0 (stonith:fence_ipmilan): Started pod2-stack-controller-3 
my-ipmilan-for-controller-3 (stonith:fence_ipmilan): Started pod2-stack-controller-3

Contributed by Cisco Engineers

Karthikeyan Dachanamoorthy
Cisco Advance Services
Harshita Bhardwaj
Cisco Advance Services

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Prime Access Registrar