Monitoring, Best Practices, Caveats, and Troubleshooting
This chapter includes the following major topics:
•Implementation Best Practices
InMage Resource Monitoring
InMage components are deployed at both the SP and enterprise. Depending on where the components reside, resource utilization can be monitored with a combination of various tools. As the VMDC CLSA team has done extensive work in the area of service assurance in a SP VMDC cloud, interested readers should refer to the VMDC CLSA for additional details. On the Enterprise side, if an Enterprise already deploys a comprehensive and well-managed infrastructure and systems monitoring program to enable proactive and make better infrastructure planning decisions based on historical trending and detailed usage analysis, we recommend those Enterprises to simply incorporate InMage components into their existing monitoring framework. This section does not intend to repeat previous CLSA recommendations or provide guidance on Enterprise end-to-end monitoring; instead we are focusing on specific metrics that an Enterprise or SP can gather based on our lab implementation.
Metrics such as storage IOPS, IO size, WAN bandwidth utilization, CPU usage on the primary server, CPU usage on the processing server, and RPO are all important metrics to monitor for performance and capacity planning. Most of those statistics are available directly from InMage or can be accessed through the environment that InMage connects to:
•WAN Bandwidth: InMage CX-CS server, vCenter statistics (V2V), Netflow (P2V)
•LAN Bandwidth Per Virtual/Physical Machine: InMage CX-CS server, vCenter statistics (V2V), NetFlow (P2V)
•CPU (PS): InMage CX-CS server, vCenter statistics (V2V), SNMP/SSH
•CPU (Agent): vCenter (V2V), Windows PerfMon, SNMP/SSH
•RPO: InMage
•IO: perfmon, iostat and drvutil utility from InMage
This section will focus on statistics monitoring using InMage. vCenter monitoring is well documented; refer to vSphere Performance Monitoring for monitoring details and operations. NetFlow monitoring can be deployed at key observation points such as the server access layer, fabric path domains, and WAN to gain visibility into LAN / WAN bandwidth and application performance. The Cisco NetFlow Generation Appliance (NGA) 3240 introduces a highly scalable, cost-effective architecture for cross-device flow generation in today's high-performance data centers. Refer to the 3240 Datasheet for details.
•Scout Server Health Monitoring
Bandwidth Monitoring
Although in different formats, LAN/WAN bandwidth reporting can be generated directly from the CXCS server or from the RX server for a particular customer/tenant. From the CX-CS UI statistics are grouped based on roles:
Network traffic statistics for all ScoutOS-based devices, the PS and CX server, are available from the CX by accessing Monitor > Network Traffic Report. Statistics are stored in RRD databases maintained by the RRDtool, available in 24 hours, week, month and year intervals. As in most RRD implementations, statistics are more granular for the 24 hour interval and less granular for older statistics. Figure 5-1 is an example of network traffic rate for an PS server.
Figure 5-1 Sample Network Traffic Rate for a PS Server
Network traffic statistics for Unified Agent-based devices, the source and MT server, are available from the CX by accessing Monitor > Bandwidth Report. Similar to ScoutOS statistics, data are stored in RRD databases maintained by the RRDtool. Daily aggregate statistics are also available. Figure 5-2 is an sample daily bandwidth report.
Figure 5-2 Sample Daily Bandwidth Report
The RX UI reports only aggregate network traffic from Unified Agents based on a time interval. Traffic report for the PS and CX are not available from the RX.
Although the majority of the performance stats are not directly exportable from the GUI, fetching data directly from the RRD database is fairly simple and straightforward. The following is an example of fetching bandwidth data from Jul 24 2013 07:46:18 to 08:03:48:
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]# pwd /home/svsystems/052E4A5E-8195-1341-90A49C18364A0532
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]# rrdtool fetch bandwidth.rrd AVERAGE --start 1374651978 --end 1374653028
in out
1374652200: 0.0000000000e+00 8.1258306408e+08 1374652500: 0.0000000000e+00 2.2297785464e+08 1374652800: 0.0000000000e+00 7.3103023488e+08 1374653100: 0.0000000000e+00 4.4805078912e+08
Customized graphs can be generated easily as well using the RRDtool:
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]# rrdtool graph xgao.png --start
1374651978 --end 1374653028 DEF:myxgao=bandwidth.rrd:out:AVERAGE LINE2:myxgao#FF0000
Scout Server Health Monitoring
Scout Server statistics for CPU, memory, disk and free space are directly available from the CX-CS portal. Statistics are displayed at near real time at the CX-CS UI dashboard. Historical performance data are kept in round-robin databases (RRD) and maintained by the RRDtool similar to the bandwidth reports.
In a single core system, your System Load/Load Average should always be below 1.0, meaning that when a process asks for CPU time it gets it without having to wait. It is important to keep in mind that PS/CS server are typically deployed with multiple cores. When monitoring such system, the rule of thumb is max load should not exceed number of cores. Use the following command to figure out the number of cores on your system:
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]# cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
processor : 2
processor : 3
CPU load or CPU percent is the amount of time in an interval that the system's processes were found to be active on the CPU. While it can be a good indicator of overall utilization when used in conjunction with system load, it is important to remember that CPU percent is only a snapshot of usage at the time of the measurement, this statistics alone is not a good indication of overall utilization.
Monitoring memory usage on a linux system can be tricky because RAM is used to not only store user application data, but also kernel data as well as cache/mirror data stored on the disk for fast access (Page Cache). Page Cache can consume large amount of memory in general, anytime a file is read, file data goes into memory in forms of page cache. Inode and Buffer cache are kernel data cached in memory. In a typical system it is perfectly normal to see memory usage increases linearly over time as in Figure 5-3:
Figure 5-3 Memory Usage
When memory usage reaches some watermark, the kernel starts to reclaim memory from the different cache described above. Swap statistics can be a good indication of overall memory health . vmstat can be used to monitor swap usage:
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]# vmstat 5
procs memory swap io---- --system cpu
r b swpd free buff cache si so bi bo in cs us sy id wa st
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
"si" and "so" are abbreviations for "swap in" and "swap out", the size of the swap can indicate the health of the system:
1. small "si" and "so" is normal, available memory is sufficient to deal with new allocation requests.
2. large "si" and small "so" is normal as well.
3. small "si" and large "so" indicates that when additional memory is needed, system is able to reclaim / allocate sufficient memory.
4. large "si" and "so" indicates system is "thrashing" and we are running out on memory Large "so" and "si" should be avoided. Small swap in and large swap out considered to be normal.
Dirty write is another metric to indicate if system is running low on memory. This information can be obtained from meminfo file:
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]# more /proc/meminfo | grep Dirty
Dirty: 704 kB
[root@sp-t10-ps-1 052E4A5E-8195-1341-90A49C18364A0532]#
Disk activity should align with the underline storage configuration. Refer to "Implementation Best Practices" section.
RPO and Health Monitoring
Recovery Point Objective (RPO) is the maximum amount of data loss tolerated during disaster recovery. Depending on data change rate, WAN, and storage, RPO values can fluctuate. InMage, by default, will attempt to maintain close to zero RPO through CDP. In reality, achievable RPO is a function of available resources both at the Primary and secondary data center. Monitoring real time RPO is fairly simple using InMage CX-CS or RX. Achievable RPO is reported per volume for each virtual/physical server under protection. In addition to real time reporting, InMage also provides historical trending as part of the Health Report.
Real time RPO can be monitored directly from the CX-CS Dashboard under protection details on the Plan Summary > Protection Details page as shown in Figure 5-4.
Figure 5-4 RPO Protection Details
If the reported RPO is worse than the target RPO, it is important to find out where the bottleneck may reside. The "Differential Data in Transit" column in Figure 5-4 is designed to quickly identify potential bottlenecks:
•If the majority of the differential data resides on the Primary Server, the bottleneck is most likely within the primary site. Health of the Primary server, campus LAN connectivity, and Processing server health should be investigated.
•If the majority of the differential data resides on the CX-PS server, this could be an indication of WAN congestion or CX-PS server processing delay.
•If the majority of the differential data resides on the secondary server, then the health of the MT needs to be investigated. It is possible the MT couldn't process the incoming data fast enough or recovery site storage isn't sufficient to keep up with the change rate.
•If the majority of the differential data resides on both the CX-PS and secondary server, then the issue may be due to small I/O size, low compression ratio, or slow data draining at the target. Refer to "I/O Monitoring" section for details.
Health Report provides a historical view of all volumes under protection. Data are sorted by date and by server. It displays health details such as data changes with and without compression, available retention window in days, RPO threshold and violation, available consistency points, and so on. Refer to Figure 5-5 for a complete list of options.
Figure 5-5 RPO Health Report
Key parameters to monitor are:
•Retention Window Available: If available value is less than the configured value, this may be an indication actual change rate exceeds the allocated retention journal space. Refer to Retention Volume Sizing, page 3-3 for additional details. For time-based retention policy, increase the size of the MT retention volume may provide a temporary relief.
•Maximum RPO: If Maximum RPO reached exceeds the RPO threshold, this could be an indication of WAN and storage delay / bottleneck. Refer to the previous discussion on real time RPO monitoring for details.
•Data Flow Controlled: This occurs when incoming change rate from the primary server exceeds the 8 GB buffer allocated by the PS server. When in data flow-controlled state, instead of sending all change blocks to the process server, the primary server will cache all changes locally in memory (up to 250 MB). Once the PS buffer utilization drops below 8 GB, the normal exchange of data between the primary and PS sever will resume. Flow controlled state is also known as thrashing state; it is intended to give the PS server sufficient time to process data it already has by backing off the primary server. This condition occurs if there's lack of sufficient WAN bandwidth or storage resources at CX-CS / MT. Refer to "Scout Server Storage and Compute Implementation" section on page 3-6 for details.
•Available Consistency Points: The available consistency point should align with retention window available. If the retention window has issues, there will be similar issues for available consistency points. Under rare circumstances where the retention window is normal, but available consistency points are low or does not exist, confirm if the proper version of VSS is installed with all required OS patches, remove third party backup software, if any, and finally confirm if sufficient disk space is available on the primary server. VSS requires at least 650MB of free disk. CX-CS log also keeps track of failure reasons, refer to "Configuration and Processing Server Logging" section for additional details.
The Health Report is available from both the CX-CS UI and the RX.
I/O Monitoring
Applications will generally vary greatly in their I/O, and can change over time as the workload changes. Understanding the I/O is critical guarantee application performance. The first thing to understand is how much I/O the application under protection is doing, total I/Os per Second (IOPs), and the consistency of the I/O load.
•Is it constant, or bursty?
•If the load is bursty, what are the sizes of the largest spike, and duration / frequency of the spike?
•Does the spike occur predictably? or random?
•What is the growth trend of existing workloads under protection?
•Does IOPs growth correlate with capacity growth, or is performance growth out-pacing capacity growth?
Understanding the I/O size is also critical: some applications do small block I/O (less than 64K), and while others do large streaming I/Os, sequential write of MBs of data to disk. There can be differences even within the same application. Take MS-SQL for example: online transaction processing (OLTP) workloads tend to select a small number of rows at a time. These transfers are fairly small in size. Data warehouse applications tend to access large portions of the data at a time. These operations result in larger I/O sizes than OLTP workloads do. I/O size is important because it impacts storage bandwidth and latency. 1K IOPS with 4K IO size results in 4 MB/s storage throughput, while 64K I/O size will drive 16x the bandwidth. As I/O size increases, amount of time to write data to disk (latency) also increases.
The information in Table 5-2 needs to be obtained before onboarding a new customer and should be constantly monitored for existing customers.
In a Windows-based system, the counters above are available in Perfmon. For more information about Performance Monitoring on your specific version of Windows, refer to Microsoft support sites. Linux hosts can use iostat to gather similar performance statistics. Refer to Monitoring Storage with Iostat for additional details.
Implementation Best Practices
Master Target Total VMDK
A maximum of 60 VMDKs/RDMs can be attached to a MT, of which at least three are already in use (MT OS, cache, and retention VMDK). That leaves 57 slots free. The following factors should be considered when determining the maximum number of VMDKs a single MT should protect:
•DR Drill: When performing a drill, InMage attaches all the disk snapshots of a VM to the MT. That is, if one of the VMs being protected to the MT has three disks, then you will need at least three SCSI IDs open to be able to perform a drill for that VM.
•Future Growth: Application growth; move from a medium share point deployment to a large deployment. Total allocated SCSI slots should not exceed 25 - 30 (MT OS and retention included). Therefore, thirty more are open for DR drill, if the desire is to perform a DR drill for all VMs mapped to a MT all at once, as opposed to few at a time. As a best practice, InMage recommends to not exceed forty SCSI slots on a MT. The remaining slots are reserved for disks/ VM addition or DR drill.
Master Target Cache
As a design guideline, 500MB of disk space per VMDK under protection should be reserved on the cache volume. The total size of the cache volume is a function of total number of volumes under protection:
Size of Cache Volume = (Total number of volumes under protection) * (500MB per volume)
Recovery Plan
There is one recovery plan per protection plan. To achieve RTO SLAs for large environments, you can have pre-defined recovery plans created ahead of time with the "Recover later" option and trigger them all at the same time from within vContinuum or the Scout server UI.
Scout Server
To properly tune the system, monitor the data in transit per volume and hourly data change history. If a particular server is generating a large change rate or if there's a WAN/Storage bottleneck, it is recommended to proactively increase the cache disk on the processing server. The main intent behind increasing these thresholds is to cache data on the process server instead of the sources.
Retention Policy
At the time of protection you can provide space or time based policy or both for retention. You could change this policy whenever you need from CX-GUI. Best practice for consistency interval and retention length depends on the following factors:
•Disk space available for retention.
•Number of servers and change rate from those servers assigned to retention drive.
•Nearest point you can go back and recover data using book mark or tags for application servers.
|
|
|
|
|
---|---|---|---|---|
300GB |
390GB |
10K /15K |
8 |
RAID 1+0 |
700GB |
790GB |
10K /15K |
12 |
RAID 1+0 |
1TB |
790GB |
10K /15K |
24 |
RAID 1+0 |
Storage Array Sizing
Profiling on the I/O characteristics of all customer work loads to determine the type of storage configuration is required at the SP's cloud. Classify the workload into the following:
•Steady State
•Customer Onboard
•DR Recovery / Drill
For each use case, characterize the worst case number of VM, read / write ratio, and average IO size. As an example:
•Workload 1: Normal Operation
–Maximum of 250 VMs
–Each VM requires 96 IOPS on SP-provided storage average
–Storage is characterized as 33% read / 33% random write / 34% sequential write
–Average read/write size is 20KB
–75GB of space required per VM
•Workload 2: Onboarding
–Maximum of 3 VMs at any given time
–Each VM requires 80 IOPS average
–100% sequential write
–Average read/write size is greater than 16KB
–75GB of space required per VM
•Workload 3: DR Operation / Recovery
–Maximum of 250 VMs
–Each VM requires 96 IOPS on SP-provided storage average
–Storage is characterized as 67% read / 33% random write
–Average block size is 20KB
–75GB space required per VM
•Workload 4: Mixed Normal Operation and Recovery
–Maximum of 250 VMs
–Each VM requires 96 IOPS on SP-provided storage average
–Storage is characterized as 50% read / 25% random write / 25% random read
–Average block size is 20KB
–75GB space required per VM
Based on each workload characteristic and storage vendor, determine if combinations of SSD / SAS / SATA could fit into your environment. Both FlexPod and vBlock offer auto tiering, but there are some major differences in terms of implementation.
•VMAX:
–Performance Time Window: 24/7/365 for continuous analysis
–Data Movement Window: 24/7/365 for continuous data movement
–Workload Analysis Period: 24/7/365 for continuous analysis
–Initial Analysis Period: Can be configured to be between 2 hours and 4 weeks, The default is 8 hours.
–FAST-VP Relocation Rate: 1 to 10, 786KB chunks
–Promotion to Cache: Immediate
•VNX:
–Performance Time Window: 24/7/365 for continuous analysis
–Data Movement Window: Once every 24 hours
–Workload Analysis Period: Once an hour
–Initial Analysis Period: Once an hour
–FAST-VP Relocation Rate: Low/Medium/High, 1GB chunk
–Promotion to Cache: Immediate
•NetApp:
–Performance Time Window: 24/7/365 for continuous analysis
–Data Movement Window: 24/7/365 for continuous data movement
–Workload Analysis Period: 24/7/365 for continuous analysis
–Promotion to Cache: Immediate
–Random Write < 16K IO size - SSD
–Random Write > 16K IO Size - SAS /SATA
–Sequential Writes - SAS / SATA
Depending on storage platform and vendor, I/O size can influence if write cache can be optimized. Data movement window can influence whether a special on-boarding strategy needs to be implemented. Check with the storage vendor to determine the optimal storage configuration.
InMage Interaction with MS SQL
Replication and backup products truncate application logs through VSS. However, the VSS writer implementation for MS SQL does not expose the ability to truncate logs. This is different from the behavior of MS Exchange VSS Writer, for example, which exposes the API to truncate logs. Due to this, InMage does not have a way to truncate MS SQL logs. For DB type of apps, the recommendation is to continue native application backup on a regular basis to maintain log sizing.
Application Consistency
In situations where Windows 2003 (Base, SP1) source machines have applications that require application quiesce, it is strongly suggested to upgrade to Windows 2003 SP2 to overcome the VSSrelated errors.
vSphere Version
To provide failover from enterprise to SP, the secondary vSphere (SP) version should be either the same or higher than the source (enterprise) vSphere server. To perform a failback from SP to
Enterprise, enterprise vSphere version should be either the same or higher than the SP vSphere. vSphere server may need to be upgraded if failback is required.
OS Recommendations
For new installations, InMage recommends:
•Secondary ESXi Platform: ESXi 5.1
•MT Platform for Windows: Windows 2008 R2 Enterprise Edition
•MT Platform for Linux: CentOS 6.2 64-bit
•CX Scout OS: CentOS 6.2 64-bit
•vContinuum: Windows 2008 R2
Protect Windows 2012 VM with ReFS Filesystem. This requires matching the 2012 MT.
SSH Client
Linux bulk agent install requires SSH access to originate from the primary server. Ensure SSH client is installed on all Linux primary servers.
Tenant Self Service
One way to achieve complete self service capability is to allow tenants to have access to the tenant-specific vContinuum Server. For security reasons you may prefer to create a non-administrator role on the vCenter for each tenant vContinuum user. The following privileges should be selected:
–Datacenter
–Datastore
–Folder
–Host
–Network
–Resource
–Storage views
–Virtual machine
–vSphere Distributed Switch
Using vSphere RBAC, assign tenant-specific vSphere resources to the newly created tenant user/role.
Upgrade Sequence
General upgrade sequence follows:
•Upgrade the CX-CS.
•Upgrade the processing server.
•Upgrade the MT.
•Upgrade the agent on source physical or virtual server. Use the CX-CS UI instead of vContinuum to upgrade the agents.
Always refer to the release notes for the exact sequence.
Caveats
Refer to the following InMage documents:
•InMage RX Release Notes and Compatibility Matrix Documents
http://support.inmage.net/partner/poc_blds/14_May_2013/Docs/RX/
•InMage CX Release Notes and Compatibility Matrix Documents http://support.inmage.net/partner/poc_blds/14_May_2013/Docs/Scout/
•InMage vContinuum Release Notes and Compatibility Matrix Documents http://support.inmage.net/partner/poc_blds/14_May_2013/Docs/vContinuum/
Disk Removal
vContinuum does not support disk removal. To remove a disk, first remove the VM from the protection plan and then add VM without the removed disk back into the protection plan. This operation requires a complete resync between enterprise and service provider.
Storage Over Allocation
vContinuum does not check for available capacity on the MT retention drive at the first protection. It is possible to reserve capacity beyond what is available. No alarms are generated for this condition; use the following procedure to confirm if you are running into this condition:
1. Create a new protection plan from the vContinuum.
2. Select any random Primary VM.
3. Select secondary ESX host.
4. When selecting data stores, confirm if the retention drive is available as an option.
If the retention drive is available, then the available capacity has not been over-subscribed. If the OS volume is the only available option, it is strongly recommended to manually reduce the retention size on protected VMs from the CX UI.
Protection Plan
A protection plan can only map to a single MT. No mechanisms exist to migrate a protection plan between MTs.
Continuous Disk Error
Linux MT continuously generates the following error:
Feb 26 11:08:32 mtarget-lx-22 multipathd: 36000c29f5decda4811ca2c34f29a9fdc: sdf - directio checker reports path is down
Feb 26 11:08:32 mtarget-lx-22 kernel: sd 4:0:0:0: [sdf] Unhandled error code
Feb 26 11:08:32 mtarget-lx-22 kernel: sd 4:0:0:0: [sdf] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Feb 26 11:08:32 mtarget-lx-22 kernel: sd 4:0:0:0: [sdf] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
•Root Cause: Each MT can have up to four SCSI controllers; each controller can have up to 15 disks. When adding a new SCSI controller, VMware requires a disk to be associated with the controller. Since disks are only added during protection time, the InMage workaround for this is to associate a dummy disk to the SCSI controller and delete the dummy disk once the controller is added. When OS attempts to acquire locks to disks that no longer exist, this causes the continuous error log. The error above is cosmetic in nature and can safely be ignored.
CX Integration with RX
If a CX-CS server is configured with dual NICS (one to communicate with the primary server and the second to communicate with the RX), use the push method instead of pull when adding the CX-CS to the RX. This is a known limitation.
VMware Tools
For virtual-to-virtual protection, updated VMware tools are required for InMage. vContinuum will not add a server to a protection plan if VM tools are not started or out of date.
Statistics Export
The network traffic rates cannot be exported from the CX-CS UI; it is only available from the dashboard as a daily, monthly and yearly graph. However, it is possible to access the raw data (RRD format) from the CS/PS Server.
Agent Status not reflected in CX Alerts and Notifications Window
CX-CS server does not raise any alarms in the CX Alerts and Notifications window when a source machines fails into bitmap mode. The machine's bit map mode cannot be recovered in the SP VPC.
Recovery Plan
1. Tenant-created recovery plans are not visible from vContinuum until execution. Once executed, recovery status can be monitored from vContinuum.
2. SP-created recovery plans (recover later) from vContinuum are intended to be SP managed and not visible to the tenant via multi-tenant portal.
3. Protect > Manage protected Files/Folders is a single source of truth. Both tenant-created and SP-created recovery plan can be viewed.
For recover later plans, outside of the Multi-Tenant Portal, there is no mechanism to view VMs included in a recovery plan. The assumption is the recovery plan should map exactly to a protection plan. If protection plan changes, a new recovery plan should be created.
Protection Plan Span Across Multiple Master Targets
All protected VMs disk have to reside in a single MT and cannot be spanned over multiple MTs.
Vmotion
Vmotion of MT from one ESX host to another is supported. Vmotion of MT storage will not work due to UUID changes.
Multi-tenant Portal
Multi-tenant portal does not support protection plan modifications. The following are InMage roadmap items:
•Add disk to a protected VM (Windows / Linux).
•Add VM (Windows / Linux) to an existing plan.
•Add physical servers (Windows / Linux) to an existing protection plan.
Portal Update
The default sync interval between the CX-CS and RX is 10 minutes. The tenant portal can lag behind the CX-CS UI by up to 10 minutes.
vContinuum Agent Install
vContinuum agent install wizard has the following limitations:
•Indicates if a server already has agents installed.
•Ability to select which network interface to use to push agent.
•NAT configuration for servers with multiple IPs.
vContinuum agent install may fail on servers with multiple NICs. Workaround is to manually install agent.
VM Name
During Windows failback, VM name configured in the primary vCenter may be modified to match the VM name in the secondary site. This issue is not observed on Linux machines.
Recovery Plan
If a disaster recovery is launched from the CX portal, depending on sequence of execution, successful recovery job maybe reported as failed. This is an UI reporting issue. To ensure accurate recovery job reporting, use the vContinuum to initiate the recover job.
Failback Readiness Check
When performing bulk failback recovery from secondary site to primary site using tag consistency, it is possible that readiness check may pass even when portion of the VM(s) have not reached tag consistent status. Workaround is to visually inspect the vContinuum log, ensure there are no errors before proceeding with live failback.
Recovery Readiness Check from the RX Portal
RX does not validate vCenter access as part of recovery readiness check.
Documented Procedure for Extending a Linux Volume/Disk
The procedure for extending a Linux volume/disk is missing in the online documentation. The procedure is documented in Appendix B Extending a Linux Volume.
vContinuum V2P Recover Does Not Complete
The vContinuum wizard V2P recovery plan fails to reboot the physical server after the recovery is complete. At this point in the recovery plan, the recovery is complete, yet the recovery script never completes. Reboot the physical server from the console to complete the V2P recover plan.
Linux P2V and V2P Recover Fails to Bring Up Ethernet Interface
The P2V and V2P recover plan for a Linux physical server fails to restore the network configuration. The interface network configuration can be recovered by entering the service network restart command. However, this is work-around is not persistent and requires reentering the service network restart command after each reboot.
V2P Failback Plan Requires Resync
The V2P failback for a Linux physical server requires a Resync immediately after volume replication is complete. Once the volume replication achieves differential sync status, use the CX UI to manually Resync the volume replications.
Recovery Plan Readiness Check
The vContinuum wizard does not fail the recovery plan when any primary VM in the recovery plan fails the readiness check. Instead, the vContiuum wizard removes any primary VMs from the recovery plan that fail the readiness check and allows the user to proceed.
In situations where a large number of primary VMs are being recovered, it is possible that the user may not be aware that some of primary VMs in the recovery plan failed the readiness check. This can happen when the primary VM that fails the readiness check is not visible within the current vContinuum window and the user must scroll down to see the error.
Troubleshooting
Troubleshooting of DR events, both successful and failed, is critical to maintain business continuity. In an environment where business service level agreements (SLAs) are tied to speed to recovery, effective troubleshooting is a integral part of accomplishing those goals. Successful troubleshooting starts with the ability to trace an event from beginning to end, by starting from the highest level and drilling down towards each component. The ability to correlate multiple events and presentation of those events is a fundamental requirement.
Configuring and setting up protection plan is a multi-step process, it starts from tenant-side discovery and reaches steady state with Differential Sync. Table 5-4 provides a brief workflow description, software components involved, and relevant log files to review when failure occurs.
This section will introduce important log files and dependencies for InMage components:
•Configuration and Processing Server Logging
VMware Tools
For V2V protection, VMware Tools are required for source-side discovery. To confirm if VMware Tools is running on Windows, log onto the source server and under Services confirm that the VMtool is started, as shown in Figure 5-6.
Figure 5-6 VMware Tools
On Linux, issue:
ps -ef | grep vmtoolsd
vContinuum Logging
vContinuum is a stateless application that communicates with the CX-CS server to create and manage various protection and recovery plans. Based on user inputs, it pulls/pushes configuration changes to/ from the CX-CS server. Detailed exchange between vContinuum and CX-CS server log are all logged on the vContinuum server under:
C:\Program Files (x86)\InMage Systems\vContinuum\logs
C:\Program Files (x86)\InMage Systems\vContinuum\Latest:
Important vContinuum log files are:
1. vContinuum.log - Use this log file to follow detailed events for jobs triggered from the vContinuum. The following is an example of Recovery Plan:
7/30/2013 10:58:53 AM Parameter grp id Task1 Initializing Recovery Plan: 7/30/2013 10:58:53 AM Name
This will initializes the Recovery Plan. It starts the EsxUtil.exe
binary for Recovery: 7/30/2013 10:58:53 AM Description
Completed: 7/30/2013 10:58:53 AM TaskStatus
7/30/2013 10:58:53 AM Logpath /home/svsystems/vcon/Demo_recovery_35790/
EsxUtil.log
7/30/2013 10:58:53 AM Parameter grp id Task2
Downloading Configuration Files: 7/30/2013 10:58:53 AM Name
The files which are going to download from CX are 1.
Recovery.xml: 7/30/2013 10:58:53 AM Description
Completed: 7/30/2013 10:58:53 AM TaskStatus
7/30/2013 10:58:53 AM Logpath /home/svsystems/vcon/Demo_recovery_35790/
EsxUtil.log
7/30/2013 10:58:53 AM Parameter grp id Task3
Starting Recovery For Selected VM(s): 7/30/2013 10:58:53 AM Name
The following operations going to perform in this task: 1.
Remove pairs for all the selected VMs 2. Completes network
related changes for all VMs 3. Deploys the source disk layout on respective target disk(in case of windows): 7/30/2013 10:58:53 AM Description
Completed: 7/30/2013 10:58:53 AM TaskStatus
7/30/2013 10:58:53 AM Logpath /home/svsystems/vcon/Demo_recovery_35790/ EsxUtil.log
7/30/2013 10:58:53 AM Parameter grp id Task4
Powering on the recovered VM(s): 7/30/2013 10:58:53 AM Name
This will power-on all the recovered VMs 1. It will detach
all the recoverd disks from MT 2. Power-on the recovered VMs: 7/30/2013 10:58:53 AM Description
InProgress: 7/30/2013 10:58:53 AM TaskStatus
7/30/2013 10:58:53 AM Logpath /home/svsystems/vcon/Demo_recovery_35790/ EsxUtil.log
7/30/2013 10:59:18 AM paragroup <FunctionRequest Name="MonitorESXProtectionStatus" Id="" include="No"><Parameter Name="HostIdentification" Value="0451173A-C182-8D46-9C18A7E2E844E42D"/ ><Parameter Name="StepName" Value="Recovery"/><Parameter Name="PlanId" Value="9"/></FunctionRequest>
7/30/2013 10:59:18 AM Count of parametergroup 2
Files in the "C:\Program Files (x86)\InMage Systems\vContinuum\Latest" are XML files that includes inventory information regarding:
•Primary Server under protection.
•Secondary Service Provider ESXi Environment.
•Detailed Storage information in Primary and Secondary site.
•Detailed network profile information in Primary and Secondary site. Figure 5-7 is an example of MasterConfigFile (detailed protection plan).
Figure 5-7 MasterConfigFile Example
Configuration and Processing Server Logging
Logs required to troubleshoot the processing server and CS-CX can be accessed directly from the CS-CX UI. Summary of available logs can be found by navigating to Monitor > CX Logs. Table 5-6 shows the logs that are most relevant for troubleshooting:
Troubleshooting directly from Monitor > CX logs can be overwhelming since it is a aggregation point for all logs. The best way to retrieve and view logs is to have some context around the failure. Having the right log file corresponding to the failure event simplifies the root cause isolation and avoids going down the wrong path due to conflicting error messages across different log files. To simplify the debugging process:
1. Always start from Monitor > File Replications.
2. Filter replication logs based on Status. Status > Failed > Search.
3. Find the corresponding failed job based on application name and then click on view details.
4. From the details windows, clock on log to open the corresponding log file relating to the failure event. Figure 5-8 is a screen capture:
Figure 5-8 View Log
InMage Tools
The following InMage tools can be used to for troubleshooting:
1. DrvUtil - Primary Server. DrvUtil utility can be used to inspect and modify InMage DataTap driver settings on the primary server. For monitoring and troubleshooting purposes, Drvutil --ps and --pas options can be useful to monitor IO Size, number of dirty blocks in queue, size of the change block and so on. Other features of drvutil should not be used unless instructed by InMage support team.
2. ConfiguratorAPITestBed - Primary Server. This tool is most useful when troubleshooting initial data protection issues (Resync). First confirm if "dataprotection" process is running; if it's not, use configuratorapitestbed.exe to check if svagent is receiving configuration settings from the CX-CS correctly.
C:\Program Files (x86)\InMage Systems>ConfiguratorAPITestBed.exe --default
C:\Program Files (x86)\InMage Systems>ConfiguratorAPITestBed.exe --custom --ip
8.24.81.101 --port 80 --hostid 3F8E5834-AA0C-F246-B915D07CFB5D49CC
This is a Windows-based utility; there isn't a equivalent Linux version. To find out configuration settings for a Linux primary servers, use any available Windows DataTap agent and run the ConfiguratorAPITestBed command using the --hostid option where the hostid is the id of the Linux host.
3. cdpcli - Master Target. Use this utility on the MT to gather information regarding replication statistics, IO pattern, and protected volume.
c:\Program Files (x86)\InMage Systems>cdpcli.exe --listtargetvolumes
C:\ESX\3F8E5834-AA0C-F246-B915D07CFB5D49CC_C C:\ESX\4FC26A25-683F-DC48-86F0C31180D4A5C0_C C:\ESX\85F7759A-2EC9-9348-B8C0AD1E9A3F0331_C C:\ESX\B6688D89-7471-EA43-AC76FBA7EDB948C6_C C:\ESX\E08CD805-888E-944D-B60496095D3914DD_C C:\ESX\3F8E5834-AA0C-F246-B915D07CFB5D49CC_C__SRV C:\ESX\4FC26A25-683F-DC48-86F0C31180D4A5C0_C__SRV C:\ESX\85F7759A-2EC9-9348-B8C0AD1E9A3F0331_C__SRV C:\ESX\B6688D89-7471-EA43-AC76FBA7EDB948C6_C__SRV C:\ESX\E08CD805-888E-944D-B60496095D3914DD_C__SRV C:\ESX\3F8E5834-AA0C-F246-B915D07CFB5D49CC_K C:\ESX\B6688D89-7471-EA43-AC76FBA7EDB948C6_K C:\ESX\E08CD805-888E-944D-B60496095D3914DD_K
c:\Program Files (x86)\InMage Systems>cdpcli.exe --displaystatistics -- vol="C:\E
SX\3F8E5834-AA0C-F246-B915D07CFB5D49CC_C"
C:\ESX\3F8E5834-AA0C-F246-B915D07CFB5D49CC_C\ is a symbolic link to C:\ESX \3F8E
5834-AA0C-F246-B915D07CFB5D49CC_C
|
|
|
|
|
|
|
|
|
|
|
|
c:\Program Files (x86)\InMage Systems>cdpcli.exe --showsummary --vol="C:\ESX \3F8
E5834-AA0C-F246-B915D07CFB5D49CC_C"
Database: E:\Retention_Logs\catalogue
\2460F4D5-7C71-5745-9804B2F
FB039366A\C\ESX\3F8E5834-AA0C-F246-B915D07CFB5D49CC_C\ef118abbc9\cdpv3.db
Version: 3
Revision: 2
Log Type: Roll-Backward
Disk Space (app): 235020800 bytes
Total Data Files: 10
Recovery Time Range(GMT): 2013/7/31 13:37:8:730:417:7 to
2013/7/31 13:49:13:5:533:1
c:\Program Files (x86)\InMage Systems>
c:\Program Files (x86)\InMage Systems>cdpcli.exe --iopattern --vol="C:\ESX \3F8E5
834-AA0C-F246-B915D07CFB5D49CC_C"
Io Profile:
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|