Hardik's DBA Blog: How to Replace Exadata Cell Disks

When you have disk issue in your Exadata cell servers, you would have received alert specific to cell disk failure and automatic SR would have been created.

1) Modify DISK_REPAIR_TIME, if require

The DISK_REPAIR_TIME attribute of the disk group controls the maximum acceptable outage duration.

Once one or more disks become unavailable to ASM, it will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online.

If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online.

If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance, in order to restore redundancy using the surviving disks.

Once the disk(s) are back online, they will be added to the diskgroup, with all existing extents on those disks being ignored/discarded, and another rebalance will begin.

In other words, the DISK_REPAIR_TIME value is the acceptable time of duration during you need to fix the failure.

This setting is also the countdown timer of ASM to drop the disk(s) that have been taken offline.

The default setting for DISK_REPAIR_TIME is 3.6 hours.

2) Check grid disk status

list griddisk attributes name, asmmodestatus, asmdeactivationoutcome

3) Check the cell alert log why disk is showing offline

list alerthistory where severity = 'critical'

4) List all Physicaldisk with status like ".*failure.*"

   [root@cellserver ~]# cellcli
   CellCLI: Release 22.1.7.0.0 - Production on Thu Jun 22 10:53:30 CDT 2023
   Copyright © 2007, 2023, Oracle and/or its affiliates.

CellCLI> list physicaldisk where disktype=harddisk and status like ".*failure.*" detail
name: 252:8
deviceId: 18
deviceName: /dev/sdi
diskType: HardDisk
enclosureDeviceId: 252
errOtherCount: 0
luns: 0_8
makeModel: "HGST H7222A525SUN010T"
physicalFirmware: A680
physicalInsertTime: 2018-08-27T13:57:22-05:00
physicalInterface: sas
physicalSerial: R7D85N
physicalSize: 8.91015625T
slotNumber: 8
status: warning - predictive failure

CellCLI> list celldisk where lun=0_8
CD_08_cellserver proactive failure

CellCLI> list griddisk where celldisk=CD_08_cellserver attributes name, size, status, asmmodestatus
DATAC1_CD_08_cellserver 6.9296875T proactive failure DROPPED
RECOC1_CD_08_cellserver 1.97967529296875T proactive failure DROPPED

** If physical disk's status is not drop for replacement then do it **

CellCLI> alter physicaldisk 252:8 drop for replacement
Physical disk 252:8 was dropped for replacement.

Note: The command will perform checks to determine whether it is safe to remove the disk, then prepares it for replacement.

Upon it's successful completion, the blue LED on the disk will have been turned on.

A rough overview of the checks and actions that the command performs in the background:

   - Checks ASM to see if the partner disks associated to the disk to be replaced are online.
       These partner disks reside in different storage cells.
       The command will fail if any of the partner disks are offline in ASM.

This is equivalent to running "list griddisk attributes name, asmdeactivationoutcome"

- If ran against one of the two disks that the cell's operating system resides on (slots 0 and 1),
it will check if the MD partner disks are in sync. If they are not, the command will fail.

- If above checks are successful, the Management Server (MS) will request that the griddisks residing on the physicaldisk be marked as Inactive.
This results in the griddisks being offlined in ASM

- The celldisk is dropped (from memory only) and the blue LED is turned on. The disk is now ready for physical replacement.

   CellCLI> list physicaldisk 252:8 detail
      name: 252:8
   deviceId: 18
   diskType: HardDisk
   enclosureDeviceId: 252
   errOtherCount: 0
   makeModel: "HGST H7222A525SUN010T"
   physicalFirmware: A680
   physicalInsertTime: 2018-08-27T13:57:22-05:00
   physicalInterface: sas
   physicalSerial: R7D85N
   physicalSize: 8.91015625T
   slotNumber: 8
   status: warning - predictive failure - dropped for replacement

CellCLI> list griddisk where celldisk=CD_08_cellserver attributes name, size, status, asmmodestatus
DATAC1_CD_08_cellserver 6.9296875T not present DROPPED
RECOC1_CD_08_cellserver 1.97967529296875T not present DROPPED

5) Create the SR if not created automatically for disk replacement, and provide above disk details

Once the SR is created, Oracle will create internal task and will assign Oracle Field Engineer to replace disk.

We need to schedule a visit with Oracle Field Engineer and inform them to bring the part.

6) Oracle FE will replace hot swappable disk

7) Once disk is replaced,

** Validate the status of the disk re-balancing **

   SQL> set linesize 132
   SQL> col path format a50
   SQL> select group_number,path,header_status,mount_status,name from V$ASM_DISK where path like 'Í_03_abcxcel01';
      GROUP_NUMBER PATH HEADER_STATU MOUNT_S NAME

------------ --------------------------------- ------------ ------- ------------------------------

1 o/192.168.9.9/DATA_Q1_CD_03_abcxcel01 MEMBER CACHED DATA_Q1_CD_03_abcxcel01

2 o/192.168.9.9/DBFS_DG_CD_03_abcxcel01 MEMBER CACHED DBFS_DG_CD_03_abcxcel01

3 o/192.168.9.9/RECO_Q1_CD_03_abcxcel01 MEMBER CACHED RECO_Q1_CD_03_abcxcel01

    SQL> select * from gv$asm_operation;
   INST_ID GROUP_NUMBER OPERA STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE
     ---------- ------------ ----- ---- ---------- ---------- ---------- ---------- ---------- ----------- --------------------------------------------
2 3 REBAL WAIT 10
1 3 REBAL RUN 10 10 1541 2422 7298 0

** Check physical and grid disk details**

CellCLI> list griddisk where celldisk=CD_08_cellserver attributes name, size, status, asmmodestatus
DATAC1_CD_08_cellserver 6.9296875T active ONLINE
RECOC1_CD_08_cellserver 1.97967529296875T active ONLINE

Note: status and asmmodestatus should be active and online respectively

8) Check status of all physical disk on all cell nodes to verify if any other disk failure on any of the cell servers

dcli -l root -g /root/cell_group cellcli -e LIST PHYSICALDISK WHERE diskType=HardDisk AND status not like normal DETAIL

Supported Oracle MOS Doc ID

How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) ( Doc ID 1386147.1 )

How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) ( Doc ID 1390836.1 )

Things to Check in ASM When Replacing an ONLINE disk from Exadata Storage Cell (Doc ID 1326611.1)

How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) ( Doc ID 1386147.1 )

How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) ( Doc ID 1390836.1 )

Hardik's DBA Blog

Saturday, 6 January 2024

How to Replace Exadata Cell Disks

No comments:

Post a Comment

ASM Administration