When you have disk issue in your Exadata cell servers, you would have received alert specific to cell disk failure and automatic SR would have been created.
1) Modify DISK_REPAIR_TIME, if require
The DISK_REPAIR_TIME attribute of the disk group controls the maximum acceptable outage duration.
Once one or more disks become unavailable to ASM, it will wait for up to the interval specified for DISK_REPAIR_TIME for the disk(s) to come online.
If the disk(s) come back online within this interval, a resync operation will occur, where only the extents that were modified while the disks were offline are written to the disks once back online.
If the disk(s) do not come back within this interval, ASM will initiate a forced drop of the disk(s), which will trigger a rebalance, in order to restore redundancy using the surviving disks.
Once the disk(s) are back online, they will be added to the diskgroup, with all existing extents on those disks being ignored/discarded, and another rebalance will begin.
In other words, the DISK_REPAIR_TIME value is the acceptable time of duration during you need to fix the failure.
This setting is also the countdown timer of ASM to drop the disk(s) that have been taken offline.
The default setting for DISK_REPAIR_TIME is 3.6 hours.
2) Check grid disk status
list griddisk attributes name, asmmodestatus, asmdeactivationoutcome
3) Check the cell alert log why disk is showing offline
list alerthistory where severity = 'critical'
4) List all Physicaldisk with status like ".*failure.*"
CellCLI: Release 22.1.7.0.0 - Production on Thu Jun 22 10:53:30 CDT 2023
Copyright © 2007, 2023, Oracle and/or its affiliates.
name: 252:8
deviceId: 18
deviceName: /dev/sdi
diskType: HardDisk
enclosureDeviceId: 252
errOtherCount: 0
luns: 0_8
makeModel: "HGST H7222A525SUN010T"
physicalFirmware: A680
physicalInsertTime: 2018-08-27T13:57:22-05:00
physicalInterface: sas
physicalSerial: R7D85N
physicalSize: 8.91015625T
slotNumber: 8
status: warning - predictive failure
CD_08_cellserver proactive failure
DATAC1_CD_08_cellserver 6.9296875T proactive failure DROPPED
RECOC1_CD_08_cellserver 1.97967529296875T proactive failure DROPPED
** If physical disk's status is not drop for replacement then do it **
Physical disk 252:8 was dropped for replacement.
Note: The command will perform checks to determine whether it is safe to remove the disk, then prepares it for replacement.
Upon it's successful completion, the blue LED on the disk will have been turned on.
A rough overview of the checks and actions that the command performs in the background:
These partner disks reside in different storage cells.
The command will fail if any of the partner disks are offline in ASM.
it will check if the MD partner disks are in sync. If they are not, the command will fail.
This results in the griddisks being offlined in ASM
- The celldisk is dropped (from memory only) and the blue LED is turned on. The disk is now ready for physical replacement.
name: 252:8
deviceId: 18
diskType: HardDisk
enclosureDeviceId: 252
errOtherCount: 0
makeModel: "HGST H7222A525SUN010T"
physicalFirmware: A680
physicalInsertTime: 2018-08-27T13:57:22-05:00
physicalInterface: sas
physicalSerial: R7D85N
physicalSize: 8.91015625T
slotNumber: 8
status: warning - predictive failure - dropped for replacement
DATAC1_CD_08_cellserver 6.9296875T not present DROPPED
RECOC1_CD_08_cellserver 1.97967529296875T not present DROPPED
5) Create the SR if not created automatically for disk replacement, and provide above disk details
Once the SR is created, Oracle will create internal task and will assign Oracle Field Engineer to replace disk.
We need to schedule a visit with Oracle Field Engineer and inform them to bring the part.
6) Oracle FE will replace hot swappable disk
7) Once disk is replaced,
** Validate the status of the disk re-balancing **
SQL> col path format a50
SQL> select group_number,path,header_status,mount_status,name from V$ASM_DISK where path like 'Í_03_abcxcel01';
GROUP_NUMBER PATH HEADER_STATU MOUNT_S NAME
INST_ID GROUP_NUMBER OPERA STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE
---------- ------------ ----- ---- ---------- ---------- ---------- ---------- ---------- ----------- --------------------------------------------
2 3 REBAL WAIT 10
1 3 REBAL RUN 10 10 1541 2422 7298 0
** Check physical and grid disk details**
CellCLI: Release 22.1.7.0.0 - Production on Thu Jun 22 11:20:21 CDT 2023
Copyright © 2007, 2023, Oracle and/or its affiliates.
DATAC1_CD_08_cellserver 6.9296875T active ONLINE
RECOC1_CD_08_cellserver 1.97967529296875T active ONLINE
Note: status and asmmodestatus should be active and online respectively
8) Check status of all physical disk on all cell nodes to verify if any other disk failure on any of the cell servers
dcli -l root -g /root/cell_group cellcli -e LIST PHYSICALDISK WHERE diskType=HardDisk AND status not like normal DETAIL
Supported Oracle MOS Doc ID
How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) ( Doc ID 1386147.1 )
How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) ( Doc ID 1390836.1 )
Things to Check in ASM When Replacing an ONLINE disk from Exadata Storage Cell (Doc ID 1326611.1)
How to Replace a Hard Drive in an Exadata Storage Cell Server (Hard Failure) ( Doc ID 1386147.1 )
How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure) ( Doc ID 1390836.1 )
No comments:
Post a Comment