Saturday, 6 January 2024

HALRT-02007: Database Node Hard Disk Failure (System Hard Disk of Size 1.2TB in Slot X Failed) in Exadata

Terminology:

FIRMWARE: Firmware is software that provides basic machine instructions that allow the hardware to function and communicate with other software running on a device. (Reference: (FAQs About System Firmware (oracle.com))

Hot-Swappable Devices: 

Hot-swappable devices are those devices that can be removed and installed while the server is running without requiring any administrative tasks (for example, fan modules and power supplies).

Compute/Database node HDD works with RAID 5.

DISK Status:
OPTIMAL--> When the RAID is running correctly and without errors it is called OPTIMAL state. 

DEGRADED--> The hard drive in the RAID system corrupted or damaged, the system continues to function but with restrictions in performance is called DEGRADED state.

FAILED--> There are serious problem in the RAID.

PREDICTIVE FAILURE--> The predictive failure status indicates that the hard disk will soon fail and should be replaced at the earliest opportunity. It is checking the consistency on the RAID. Predictive failure count is referring the bad sector in the HDD.

 A) Pre-Replacement

    1) When you have disk issue in your Exadata servers, you would have received alert specific to system disk failure and automatic SR would have been created.

SR 3-12345678: HALRT-02007: Database node hard disk failure.

Once SR is created, Oracle will create internal task and will assign Oracle Field Engineer to replace disk.

We need to schedule a visit with Oracle Field Engineer and inform them to bring the part.

    2) To check the status of the current cache policy, use the following command, the current should be WriteBack, not WriteThrough

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -ldpdinfo -a0 | grep -i "cache policy"
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Disk Cache Policy: Disabled

    3) Check the the MegaRAID card to get the enclosure ID

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -encinfo -a0 | grep ID
Device ID: 252

     4) Check for the failed disk

Note: "Failed" or "Unconfigured(bad)" is the expected state for the faulted disk 
In this example, it is located in physical slot 2

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -pdlist -a0 | grep -iE "slot|firmware"

Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Slot Number: 2
Firmware state: Failed
Device Firmware Level: ORAB
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: ORAB

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -pdlist -a0 | grep -iE "slot|predictive|firmware"

Slot Number: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Slot Number: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Slot Number: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Failed
Device Firmware Level: ORAB
Slot Number: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB

     5) Verify the state of the RAID is optimal or degraded, with the good disk(s) online before hot-swap removing the failed disk

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -LdPdInfo -a0 | grep -iE "target|state|slot"
Virtual Drive: 0 (Target Id: 0)
State: Degraded
Slot Number: 0
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 1
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 2
Firmware state: Failed
Foreign State: None
Slot Number: 3
Firmware state: Online, Spun Up
Foreign State: None

    6) Locate command If the LED is not turned on, do this to identify the failed disk

# /opt/MegaRAID/storcli/storcli64 -PdLocate -start -physdrv[E#:S#] -a0

where E# is the enclosure ID number identified in step a, and S# is the slot number of the disk identified in step b. 

In the example above, the command would be:

/opt/MegaRAID/storcli/storcli64 -PdLocate -start -physdrv[252:2] -a0

 B) Replace Disk

    Oracle FE will visit DC and replace disk.

 C) Post-Replacement

    1) Get disk status after it is physically replaced

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -LdPdInfo -a0 | grep -iE "target|state|slot"
Virtual Drive: 0 (Target Id: 0)
State: Degraded
Slot Number: 0
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 1
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 2
Firmware state: Rebuild
Foreign State: None
Slot Number: 3
Firmware state: Online, Spun Up
Foreign State: None

        [root@servername ~]# /opt/MegaRAID/storcli/storcli64 -pdlist -a0 | grep -iE "slot|firmware|target|state|predictive"

Slot Number: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None
Slot Number: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None
Slot Number: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Rebuild
Device Firmware Level: ORAB
Foreign State: None
Slot Number: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None

    2) Check all disk info

     [root@servername ~]# /opt/MegaRAID/storcli/storcli64 -PdInfo -physdrv[252:2] -a0
Enclosure Device ID: 252
Slot Number: 2
Drive's position: DiskGroup: 0, Span: 0, Arm: 2
Enclosure position: 0
Device Id: 4
WWN: 5111E567D934A118
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.090 TB [0x8bba0cb0 Sectors]
Non Coerced Size: 1.090 TB [0x8baa0cb0 Sectors]
Coerced Size: 1.089 TB [0x8b94f800 Sectors]
Logical Sector Size: 512
Physical Sector Size:  512
Firmware state: Rebuild
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: ORAB
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5111E567d934a119
SAS Address(1): 0x0
Connected Port Number: 9(path0)
Inquiry Data: SEAGATE ST1307IN9SUN1.2TORAB2215LC2M5A
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 12.0Gb/s
Link Speed: 12.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :30C (86.00 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 12.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 12.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Exit Code: 0x00

    3) Monitor disk rebuild progress

        [root@servername ~]# /opt/MegaRAID/storcli/storcli64 -pdrbld -showprog -physdrv [252:2] -a0
Rebuild Progress on Device at Enclosure 252, Slot 2 Completed 21% in 2 Minutes.
Estimated time left is 2 Hours 38 Minutes.
Exit Code: 0x00

    4) Output after completion of disk rebuild 

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -pdrbld -showprog -physdrv [252:2] -a0
Device(Encl-252 Slot-2) is not in rebuild process
Exit Code: 0x00

    5) Check disk status and it should be "Online".

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -LdPdInfo -a0 | grep -iE "target|state|slot"
Virtual Drive: 0 (Target Id: 0)
State: Optimal
Slot Number: 0
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 1
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 2
Firmware state: Online, Spun Up
Foreign State: None
Slot Number: 3
Firmware state: Online, Spun Up
Foreign State: None

[root@servername ~]# /opt/MegaRAID/storcli/storcli64 -pdlist -a0 | grep -iE "slot|firmware|target|state|predictive"

Slot Number: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None
Slot Number: 1
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None
Slot Number: 2
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None
Slot Number: 3
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: ORAB
Foreign State: None

Useful ORACLE MOS Doc ID:  

NOTE:1967510.1 - How to Replace an Exadata X4-8/X5-2 (or later) Compute Node Server HDD (Predictive or Hard Failure) (Doc ID 1967510.1)

NOTE:1360343.1 - INTERNAL Exadata Database Machine Hardware Current Product Issues

NOTE:1360360.1 - INTERNAL Exadata Database Machine Hardware Troubleshooting

NOTE:1416303.1 - How to identify which Exadata disk FRU part number to order , based on image, vendor and mixed disk support status

NOTE:1113034.1 - HALRT-02007: Database node hard disk failure

NOTE:1113014.1 - HALRT-02008: Database node hard disk predictive failure

NOTE:1084360.1 - Bare Metal Restore Procedure for Compute Nodes on an Exadata Environment

NOTE:1071220.1 - Oracle Sun Database Machine V2 Diagnosability and Troubleshooting Best Practices

NOTE:1452325.1 - Determining when Disks should be replaced on Oracle Exadata Database Machine

NOTE:1274324.1 - Oracle Sun Database Machine X2-2/X2-8 Diagnosability and Troubleshooting Best Practices


No comments:

Post a Comment

ASM Administration

  ** ASM Administration **    ** Create ASM INSTANCES **      To create an ASM instance first create pfile, init+ASM.ora, in the /tmp direct...