HP 3PAR disk replacement

HP 3PAR disk replacement. How to deal with failed drive on 3PAR

This article treats of disk replacement on 3PAR for administrators who want to know a little more about the background of disk replacement.

3PAR logical layer

With telling about disk replacement on 3PAR, the logical layer cannot be omited, as this is the fundamental concern around hard drive replacement procedure on 3PAR. The logical layer of 3PAR consist few levels. In overall the structure is not complicated, starting with physical disk and ending on Virtual Volumes.

physical disks (PD) → logical disks (LD) → Common Provisioning Groups (CPGs) → Virtual Volumes (VV)

Physical disks are divided into chunklets, starting with 7000 series, we are talking about 1GB fixed size of chunklets. Then 3PAR is using chunklets to build LDs. This all happen without any administrator involvement. Chunklet is the basic logic unit in 3PAR terminology.  Thanks to this approach, we are receiving nicely virtualized storage, with virtual RAID approach, which gives a lot of more flexibility, also in terms of redundancy. From the other hand, while some blocks within specific chunklet are unreadable, then the whole chunklet (1GB) is marked as failed.

3PAR and RAID protection

3PAR offers virtualized approach in RAID creation. RAID is created during CPG creation and RAID behavior can be adjusted by administrator according to needs. RAID is based on chunklets, not on physical disks. Thanks to that we are in power to create CPG based on performance, or we can use slower sectors within physical disk to use them for example for backup destaging, where performance is not so important. Thanks to that we have full control on shaping our storage resources and environment under.

For example to see the details on already created CPGs, use showcpg command with a suitable parameters.

3PAR-cluster cli% showcpg -sdg
------(MB)------
Id Name Warn Limit Grow Args
0 CPG_FC_R5 - - 86304 -t r5 -ha cage -ssz 4 -ss 128 -ch first -p -devtype FC
1 CPG_SSD_R5 - - 23546 -t r5 -ha cage -ssz 4 -ss 64 -ch first -p -devtype SSD
2 CPG_DESTAGING - - 86304 -t r5 -ssz 4 -ss 128 -ch last -p -devtype FC
3 FC_r1 - - 65536 -ssz 2 -ha cage -t r1 -p -devtype FC
4 FC_r6 - - 65536 -ssz 8 -ha cage -t r6 -p -devtype FC
5 SSD_r1 - - 16384 -ssz 2 -ha cage -t r1 -p -devtype SSD
6 CPG_SSD_DEDUP - - 65536 -t r5 -ha cage -ssz 4 -ss 64 -ch first -p -devtype SSD

The exact working principle will be explained in the other article. However for now, it is good to know, that 3PAR allows to build RAID groups based on some capabilities:

  • -t: type of RAID.
  • -ha: here you can specify layout of RAID stripe size distribution. The policy can be based on cage (default), magazine, backend port.
  • -ssz: option stands for set size in terms of chunklets. Default value is based on specific RAID, for example RAID-1 2 chunklets, RAID-5 4 chunklets, RAID-6 8 chunklets.
  • -ss: with this option it is possible to set-up step size in kilobytes for
  • -ch: type of chunklets that would be preferred to build stripe (from lowest/highest available in terms of numeration – outer/inner zones of disks).
  • -p: pattern is used for creating LDs in terms of disk type (FC, SSD, NL)

How 3PAR deals with spares

Some chunklets are promoted for spares during first set up a system. The 3PAR algorithms build chunklets from physical disks to maximize the usage of outer zones within disk. As spares chunklets should be used only temporarily in emergency situation, 3PAR decided to assign spare chunklets on the inner zones of disk. The details of spare chunklets visible on your 3PAR can be shown with showspare command.

3PAR-cluster cli% showspare

Deep investigation is not needed to see that the numeration of spare chunklets on each drive has high number, hence chunklets designated for spares are from inner zones of disks, which is obviously good.

But that’s not all. The amount of chunklets that are candidates for spares are determined by policy, which can be chosen by following vendor recommendation from presents or chosen by us. We can distinguish below sparing algorithms:

  • default: amount of one full disk for every 40 disks, with required 4 disks as minimum.
  • minimum: same as default, but without required minimum drives.
  • maximum: amount of one full disk per drive cage – so called “cage level high availability”.
  • custom: defined by user, but administrator should remember to add spares while adding new drives.

Remember about vendor recommendation to create spare chunklets during first system initialization, as this is the time when layout of created system is established and specific chunklets can be distributed evenly among all physical disks.

Logging logical drive and spares?

In case of physical disk failure all new writes that would be committed to failed drive, are redirected to logging logical disk. When drive come back online or time limit for logging is reached, then reallocation is performed to free chunklets marked as spare chunklets.

To see Logical Disk that are marked as logging disks, grep log from showpd command as shown below.

ssh 3PAR-cluster showpd |grep log
Id Name          RAID -Detailed_State- Own       SizeMB   UsedMB Use  Lgct LgId WThru MapV
5 log0.0           1 normal           0/-/-/-    20480        0 log     0  ---     Y    N
6 log1.0           1 normal           1/-/-/-    20480        0 log     0  ---     Y    N
7 log2.0           1 normal           2/-/-/-    20480        0 log     0  ---     Y    N
8 log3.0           1 normal           3/-/-/-    20480        0 log     0  ---     Y    N

Following official guide, we can investigate informations related to logging drives, which are:

  • Column Use: log in cell under this column means that logical disk is used as a logging logical disk.
  • Column Lgct: The number of chunklets that are in logging mode in the logical disk.
  • Column LgId: The ID of the logging disk that is being used for logging by the logical disk.

Important information

Logging logical disk is entity that is entirely created and managed by 3PAR  system.

Replacing failed disk

The most common task on any storage array is to deal with failed drives. Storage arrays makes tremendous work with our data, especially if cache hit is not on the remarkable level. The question is, how to deal with failed drive, and what should be under our attention.

The 3PAR is starting spit out many alerts regarding some disk, marking that situation become serious.

2015-12-06 04:36:38 GMT 0 Informational Disk event hw_disk:5000C50075EB86C4 pd 7 port b0 on 0:0:1: cmdstat:0x00 (TE_PASS -- Success), scsistat:0x02 (Check condition), snskey:0x01 (Recovered error), asc/ascq:0x5d/0x0 (Failure prediction threshold exceeded), info:0x0, cmd_spec:0x0, sns_spec:0x50000, host:0x0, abort:0, CDB:2A00562EE38800001800 (Write10), blk:0x562ee388, blkcnt 0x18, fru_cd:0x32, LUN:0, LUN_WWN:0000000000000000 after 0.007s, toterr:1808, deverr:1138
2015-12-06 04:37:41 GMT 0 Degraded Disk abort hw_disk:5000C50075EB86C4;sw_pd:7 pd 7 port b0 on 0:0:1: scsi abort/sick/hwerr status TE_SMART_THRESH
2015-12-06 04:41:49 GMT 0 Informational Disk event hw_disk:5000C50075EB86C4 pd 7 port b0 on 0:0:1: cmdstat:0x00 (TE_PASS -- Success), scsistat:0x02 (Check condition), snskey:0x01 (Recovered error), asc/ascq:0x5d/0x0 (Failure prediction threshold exceeded), info:0x0, cmd_spec:0x0, sns_spec:0x50000, host:0x0, abort:0, CDB:2A000021D7C000004000 (Write10), blk:0x21d7c0, blkcnt 0x40, fru_cd:0x32, LUN:0, LUN_WWN:0000000000000000 after 0.008s, toterr:1813, deverr:1143

That was matter of time, when disk totally crash.

2015-12-06 04:47:42 GMT 1 Informational Disk state change sw_pd:7 pd 7 wwn 5000C50075EB86C4 changed state from valid to missing because disk gone event was received for this disk.
2015-12-06 04:47:42 GMT 1 Informational Disk state change hw_disk:5000C50075EB86C4 pd wwn 5000C50075EB86C45000C50075EB86C4 changed state from valid to missing because disk gone event was received for this disk.

Let’s check the situation with marked as failed disk. If you are uncertain about failed drive you can see which PDs are failed with using command showpd with -failed option. With showpd command information about system’s physical disks can be shown.

3PAR-cluster cli% showpd -failed
                           -Size(MB)-- ----Ports----
Id CagePos Type RPM State   Total Free A      B      Capacity(GB)
 7 0:7:0?  FC    10 failed 838656    0 -----  -----           900
-----------------------------------------------------------------
 1 total                   838656    0

Use again showpd with -c parameter, which gives visibility on chunklets.

3PAR-cluster cli% showpd -c 7
------- Normal Chunklets -------- ---- Spare Chunklets ----
- Used - -------- Unused -------- - Used - ---- Unused ----
Id CagePos Type State  Total OK  Fail Free Uninit Unavail Fail OK  Fail Free Uninit Fail
7 0:7:0?  FC   failed   819  0     0    0      0     667  152  0     0    0      0    0
----------------------------------------------------------------------------------------
1 total                 819  0     0    0      0     667  152  0     0    0      0    0

To see detailed information about chunklets within disk you can use command showpdch.

3PAR-cluster cli% showpdch 7

The -i parameter shown disk details.

3PAR-cluster cli% showpd -i 7
Id CagePos State  ----Node_WWN---- --MFR-- -----Model------ -Serial- -FW_Rev- Protocol MediaType -----AdmissionTime-----
7 0:7:0?  failed 5000C50075EB86C4 SEAGATE SLTN0900S5xnN010 S0N1L0LN 3P01     SAS      Magnetic  2014-07-15 12:40:41 IST
------------------------------------------------------------------------------------------------------------------------
1 total

After new disk will be replaced, event log will record it.

2015-12-08 12:03:06 GMT 1 Informational Disk state change sw_pd:160 pd 160 wwn 5000CCA0714AC3D3 changed state from new to valid because disk was admitted successfully.
2015-12-08 12:03:06 GMT 1 Informational Disk state change hw_disk:5000CCA0714AC3D3 pd wwn 5000CCA0714AC3D35000CCA0714AC3D3 changed state from new to valid because disk was admitted successfully.
2015-12-08 12:03:06 GMT 1 Informational Object added hw_disk:5000CCA0714AC3D3 Disk 5000CCA0714AC3D3 added

Important information

  • Remember that the PD_ID of replaced drive will be different from failed drive. In this example pd_id of failed drive is 7, and replaced drive has been assigned to the first available id, which is 160.
  • The array will still signal degraded status, until reallocation will be completed.
  • After reallocation, pd_id assigned to new drive disappear and replaced disk will be visible under pd_id assigned previously to failed drive.

However all chunklets that resided on failed drive must be reallocated from logging drive and spare area back to replaced disk. To see the progress use service mag command or monitor the chunklets.

3PAR-cluster cli% servicemag status
Cage 0, magazine 7:
The magazine is being brought online due to a servicemag resume.
The last status update was at Tue Dec  8 12:04:07 2015.
Chunklets relocated: 404 in 3 hours, 17 minutes and 25 seconds
Chunklets remaining: 762
Chunklets marked for moving: 762
Estimated time for relocation completion based on 29 seconds per chunklet is: 6 hours, 8 minutes and 18 seconds
servicemag resume 0 7 -- is in Progress

Unfortunately above method is not 100% reliable, and from time to time, the amount of chunklets and time for that operations are wrongly displayed. To check how the process looks like you can use showpdch command with -mov parameter. At the end of output sum of chunklets remaining for move is shown.

3PAR-cluster cli%  showpdch -mov 160

What if I want replace disk manually?

If you see that this is only a matter of time, when your disk fail, and you possess some spare drive in your closet, then you free to do it on your own.

You can do this at least on both ways. The most common is to use servicemag utility, but from time to time for some reasons command fail and disk replacement is not possible.

Let’s say that our disk layout looks like on image presented below.

3par_showpd

The –log parameter determines that write operations to chunklets of specific drive are committed to logging disk. The –pdid obviously stands for disk ID. Also you could add –wait parameter to servicemag start. Task will not run in background and you will have visibility on the whole process.

Command to use: servicemag start -log -pdid 8, which is shown on below image.

3par_servicemag

Check whether command finish successfully with servicemag status command.

3PAR-cluster cli% servicemag status –d

According to documentation:

Any I/O on the chunklets marked normal,smag, changes the states to logging and I/O is written to the logging logical disks.

Manual disk replacement procedure

In case servicemag command fail for some reasons, then you are pushed to do it manually, using the whole sets of commands.

  1. First thing to do is to stop disk from being use. To achieve it, 3PAR has special command.
  2. setpd ldalloc off <pd_id>

    And to see detailed state of disk use command showpd -s.

    3PAR-cluster cli% showpd -s <pd_id>

    3par_setpd_showpd

  3. Now you can initiate the movement process for data from specified physical disk to location chosen by system, which is one of the main steps in terms of disk replacement.
  4. The suitable command is movepdtospare with -vacate option. Vacate option makes moves pernament and removes source tags after recolocation. The -f parameter means that no confirmation is required.
    3par_movepdtospare

    In case this command fail, you will be forced to do it manually, chunklet by chunklet.

    3PAR-cluster cli% movech -perm -ovrd <pd_id>:<chunklet_location>

    where:
    -perm: chunklet are moved pernamently and original location will be forgot.
    -ovrd: allows to move chunklet to some destination even if it will have impact on quality. Option is necessary with -perm parameter.

  5. Time to see whether we have any spare chunklets on disk designated for removal, as previous step only moved data chunklets.
  6. To display chunklets marked as spare use showpdch -spr command.

    3PAR-cluster cli% showpdch -spr <pd_id>

    3par_showspare

  7. Time to see whether we have any spare chunklets on disk designated for removal, as previous step only moved data chunklets.
  8. Command designated for that kind of task is shown below. It will remove all spare chunklets off the disk. After execution check again whether any spares exist.

    3PAR-cluster cli% removespare <pd_id>:a

    3par_removespare

  9. After all previous steps you can safely remove physical disk definition from system. Hold on with physical disk replacement at this step.
  10. 3PAR-cluster cli% dismisspd <pd_id>

    3par_dismisspd disk replacement

  11. Check if dismissed disk shown us new. If yes, then it can be safely remove from magazine.
  12. 3par_dismissed disk replacement

  13. In case you put in new disk and disk will not be automatically added to the system, you have to do it manually
  14. First thing is to determine the WWN of disk. Check this with showpd -i command.

    3PAR-cluster cli% showpd -i <pd_id>

    3par_wwn_disk_replacement

    After that use admitpd command to make new disk operational for system.

    3PAR-cluster cli% admitpd <disk_wwn>

    3par_admitpd
    At the end, tunesys is necessary to make the proper layout of chunklets within CPGs.

    3PAR-cluster cli% tunesys

    3par_tune

6 thoughts on “HP 3PAR disk replacement

  1. Regmen
    Thanks a lot for your nice explanation.
    I have another Q: What happens if the failed disk is so damaged that no chunklets can be moved.
    “chunklets – move_error,disk_relocating, will retry” How do you proceed ? If those chunklets are lost,
    how and when do they get rebuild ?

    1. If some chunklets are not readable, data will be rebuild from parity (unless you built CPG using RAID0). Anyway whole process starts when disk fails and it is using spare space (or even free chunklets) from other drives.

  2. Hi
    I am using a HPE 3PAR StoreServ 7200 with 16 disk and 8 empty hdd bays. So I am going to add 4 new disk. I inserted new disks and did every single step right, but somehow it says “degraded” in status for a while and then “Failed”. What can I do?
    Thanks a lot, and sorry for bad english.

  3. Thank you for great article! I am new to 3PAR, and I have questions:
    Assume pd12 failed, run “servicemag start” to move chunklets off failed pd12, and replace the failed pd12 with new pd in same slot of pd12
    New pd now is pd160. And run “servicemag resume…”. servicemag resume starts moving back chunklets of pd12 to pd160. About half of total chunklets moved back to pd160, and pd160 gets failed caused “servicemag resume” failed.
    Questions:
    1) What do you recommend to be done?
    2) What happened to the chunklets moved back on new pd160? Are they still in spare?
    3) … in manual steps… should “setpd ldalloc on” again after “dismisspd 12” (assume pd12 is the failed pd)?
    4) … in manual steps… we run “movepdtospare -f -vacate -dr 12” to move chunklets off failed pd12 to spare chunklets. How can I move them back to new pd160 (assume pd160 is new and good drive)?

    Thank you.
    4) … running “tunesys”

Leave a Reply

Your email address will not be published. Required fields are marked *