Saturday, 6 January 2024

How to Replace Exadata Cell Disks

 When you have disk issue in your Exadata cell servers, you would have received alert specific to cell disk failure and automatic SR would have been created.

1) Modify DISK_REPAIR_TIME, if require

    The DISK_REPAIR_TIME attribute of the disk group controls the maximum acceptable outage duration. 

    Once one or more disks become unavailable to ASM, it will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online. 

    If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online. 

    If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance, in order to restore redundancy using the surviving disks. 

    Once the disk(s) are back online, they will be added to the diskgroup, with all existing extents on those disks being ignored/discarded, and another rebalance will begin. 

    In other words, the DISK_REPAIR_TIME value is the acceptable time of duration during you need to fix the failure. 

    This setting is also the countdown timer of ASM to drop the disk(s) that have been taken offline. 

    The default setting for DISK_REPAIR_TIME is 3.6 hours.

2) Check grid disk status 

    list griddisk attributes name, asmmodestatus, asmdeactivationoutcome   

3) Check the cell alert log why disk is showing offline

    list alerthistory where severity = 'critical'   

4) List all Physicaldisk with status like ".*failure.*"

    [root@cellserver ~]# cellcli
    CellCLI: Release 22.1.7.0.0 - Production on Thu Jun 22 10:53:30      CDT 2023
    Copyright © 2007, 2023, Oracle and/or its affiliates.

    CellCLI> list physicaldisk where disktype=harddisk and status like ".*failure.*" detail
        name:                   252:8
deviceId:               18
deviceName:             /dev/sdi
diskType:               HardDisk
enclosureDeviceId:      252
errOtherCount:          0
luns:                   0_8
makeModel:              "HGST    H7222A525SUN010T"
physicalFirmware:       A680
physicalInsertTime:     2018-08-27T13:57:22-05:00
physicalInterface:      sas
physicalSerial:         R7D85N
physicalSize:           8.91015625T
slotNumber:             8
status:                 warning - predictive failure

    CellCLI> list celldisk where lun=0_8
        CD_08_cellserver    proactive failure
    CellCLI> list griddisk where celldisk=CD_08_cellserver attributes name, size, status, asmmodestatus
        DATAC1_CD_08_cellserver     6.9296875T              proactive failure       DROPPED
RECOC1_CD_08_cellserver     1.97967529296875T       proactive failure       DROPPED

    ** If physical disk's status is not drop for replacement then do it **

    CellCLI> alter physicaldisk 252:8 drop for replacement
Physical disk 252:8 was dropped for replacement.

    Note: The command will perform checks to determine whether it is safe to remove the disk, then prepares it for replacement. 

    Upon it's successful completion, the blue LED on the disk will have been turned on.

    A rough overview of the checks and actions that the command performs in the background:

    - Checks ASM to see if the partner disks associated to the disk to be replaced are online. 
        These partner disks reside in different storage cells. 
        The command will fail if any of the partner disks are offline in ASM. 
        This is equivalent to running "list griddisk attributes name, asmdeactivationoutcome"

    - If ran against one of the two disks that the cell's operating system resides on (slots 0 and 1), 
it will check if the MD partner disks are in sync. If they are not, the command will fail.
    
    - If above checks are successful, the Management Server (MS) will request that the griddisks residing on the physicaldisk be marked as Inactive. 
        This results in the griddisks being offlined in ASM

    - The celldisk is dropped (from memory only) and the blue LED is turned on. The disk is now ready for physical replacement.

    CellCLI> list physicaldisk 252:8 detail
         name:                   252:8
    deviceId:               18
    diskType:               HardDisk
    enclosureDeviceId:      252
    errOtherCount:          0
    makeModel:              "HGST    H7222A525SUN010T"
    physicalFirmware:       A680
    physicalInsertTime:     2018-08-27T13:57:22-05:00
    physicalInterface:      sas
    physicalSerial:         R7D85N
    physicalSize:           8.91015625T
    slotNumber:             8
    status:                 warning - predictive failure - dropped for replacement

    CellCLI> list griddisk where celldisk=CD_08_cellserver attributes name, size, status, asmmodestatus
     DATAC1_CD_08_cellserver     6.9296875T              not present     DROPPED
RECOC1_CD_08_cellserver     1.97967529296875T       not present     DROPPED

5) Create the SR if not created automatically for disk replacement, and provide above disk details

    Once the SR is created, Oracle will create internal task and will assign Oracle Field Engineer to replace disk.

    We need to schedule a visit with Oracle Field Engineer and inform them to bring the part.

6) Oracle FE will replace hot swappable disk

7) Once disk is replaced, 

    ** Validate the status of the disk re-balancing **

    SQL> set linesize 132
    SQL> col path format a50
    SQL> select group_number,path,header_status,mount_status,name from V$ASM_DISK where path like 'Í_03_abcxcel01';
      GROUP_NUMBER PATH                              HEADER_STATU MOUNT_S NAME
 ------------ --------------------------------- ------------ ------- ------------------------------
1 o/192.168.9.9/DATA_Q1_CD_03_abcxcel01         MEMBER       CACHED  DATA_Q1_CD_03_abcxcel01
2 o/192.168.9.9/DBFS_DG_CD_03_abcxcel01         MEMBER       CACHED  DBFS_DG_CD_03_abcxcel01  
3 o/192.168.9.9/RECO_Q1_CD_03_abcxcel01         MEMBER       CACHED  RECO_Q1_CD_03_abcxcel01
   
     SQL> select * from gv$asm_operation;
    INST_ID GROUP_NUMBER OPERA STAT      POWER     ACTUAL      SOFAR   EST_WORK   EST_RATE EST_MINUTES ERROR_CODE
     ---------- ------------ ----- ---- ---------- ---------- ---------- ---------- ---------- ----------- --------------------------------------------
    2            3 REBAL WAIT         10
    1            3 REBAL RUN          10         10       1541       2422       7298           0

    ** Check physical and grid disk details**

[root@cellserver ~]# cellcli
CellCLI: Release 22.1.7.0.0 - Production on Thu Jun 22 11:20:21 CDT 2023
Copyright © 2007, 2023, Oracle and/or its affiliates.
CellCLI>  list griddisk where celldisk=CD_08_cellserver attributes name, size, status, asmmodestatus
    DATAC1_CD_08_cellserver     6.9296875T              active  ONLINE
    RECOC1_CD_08_cellserver     1.97967529296875T       active  ONLINE

    Note: status and asmmodestatus should be active and online respectively

8) Check status of all physical disk on all cell nodes to verify if any other disk failure on any of the cell servers

dcli -l root -g /root/cell_group cellcli -e LIST PHYSICALDISK WHERE diskType=HardDisk AND status not like normal DETAIL

Supported Oracle MOS Doc ID

How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) ( Doc ID 1386147.1 )

How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) ( Doc ID 1390836.1 )

Things to Check in ASM When Replacing an ONLINE disk from Exadata Storage Cell (Doc ID 1326611.1)

How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) ( Doc ID 1386147.1 )

How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) ( Doc ID 1390836.1 )

No comments:

Post a Comment

ASM Administration

  ** ASM Administration **    ** Create ASM INSTANCES **      To create an ASM instance first create pfile, init+ASM.ora, in the /tmp direct...