Introduction
This document describes the meaning of a Punctured Block on a hard drive. It also describes how a Punctured Block occurs and the remediation steps.
What is a Punctured Block?
When a Patrol Read or a Rebuild operation encounters a media error on the source drive, it punctures a block on the target drive to prevent the use of the data with the invalid parity. Any subsequent read operation to the punctured block completes, but with an error. Consequently, the puncturing of a block prevents any invalid parity generation later while using this block.
Source: 12Gb/s MegaRAID® SAS Software User Guide, Rev. F, August 2014
How do Punctured Blocks Happen?
In RAID5, the data is distributed in the form of parity across all the member disks. In this case, if one of the drives goes bad, the data can be rebuilt by calculating the parity across all the drive. There are several things which can cause a puncture, but it usually starts with a RAID that has a single failed drive that also has a drive with many medium errors or in a Predictive Failure state.
The following link provides a very good scenario where it explains how an array can get punctured:
http://www.theprojectbot.com/what-is-a-punctured-raid-array
After reading it, you should have a clear idea that when a hard disk is replaced without checking the other disks, some bad logical blocks or medium errors were relocated, and then any of the other disks may show up as failed.
A punctured block can potentially occur on multiple drives, with only 1 drive officially "failing." This can then be replicated to replacement disks, further compounding the issue.
Punctured Block Symptoms
The server may report multiple hard drive failures. Simply replacing the hard drive will NOT fix the issue. In addition, I/O performance may be degraded.
Evidence of a Punctured Block
The logs may contain entries similar to the lines below.
6:2014 Jul 27 00:36:06:BMC:storage:-: SLOT-5: Unexpected sense: PD 0c(e0x12/s5) Path 500000e11986c502, CDB: 28 00 0e 71 66 e7 00 00 19 00, Sense: 3/11/01
6:2014 Jul 27 00:36:06:BMC:storage:-: SLOT-5: Unexpected sense: PD 13(e0x12/s7) Path 50000395083063f6, CDB: 28 00 0e 71 66 eb 00 00 15 00, Sense: 3/11/14
In the above output, e0x12/s5 indicates it relates to HDD5. The following link describes the meaning of the sense code (Sense: 3/11/14):
http://en.wikipedia.org/wiki/Key_Code_Qualifier
Therefore, that sensor indicates medium errors.
The following events could also be prevent in the logs:
1:2014 Jul 16 10:42:43:BMC:storage:-: SLOT-5: Unrecoverable medium error during recovery on PD 0c(e0x12/s5) at e7166e7
1:2014 Jul 16 10:42:43:BMC:storage:-: SLOT-5: Puncturing bad block on PD 0c(e0x12/s5) at e7166e7
1:2014 Jul 19 03:46:22:BMC:storage:-: SLOT-5: Consistency Check detected uncorrectable multiple medium errors (PD 13(e0x12/s7) at e7166d9 on (null))
Possible Remediation
Anytime punctured blocks present themselves, data backups are highly recommended. When presented with the messages mentioned above, the inclination may be to look for the actual failing hard drive and replace it, however, there is a chance that multiple bad logical blocks were spread across the array. Although failed or failing hard drive(s) may have been the cause, punctured blocks will only be resolved by reconstructing the affected virtual drive(s).
- Create a data backup
- Erase the RAID array configuration
- Create a new array from scratch
Note: Note: While creating the VD (Virtual Drive), select FULL/SLOW initiatization instead of FAST initialization.
- Reinstall the operating system
- Restore the data backup.
Note: Replacing hard drives will NOT fix punctured blocks by itself. If there is a failed drive, it should be replaced, otherwise the RAID needs to be rebuilt.
Preventing Punctured Blocks
- Monitor RAIDs and the health of their member drives.
- Prior to replacing any hard drives, review controller logs.
- Ensure Patrol Reads and Consisency Checks are turned on and running (Check against bug CSCul22968).