Diagnosing memory errors with IPMI
Newer Unitrends DPU platforms use IPMI firmware which can log memory errors. For example:
Use IPMI commands to see memory errors in the firmware log.
- Download an updated ipmiutil. Skip this step if ipmiutil-3.0.0 or later is already installed.
- For CentOS 6:
- For CentOS 5:
- For CentOS 6:
- Update the RPM package:
rpm -U ipmiutil-3.0.0*.rpm
- Look for any recent memory events:
ipmiutil sel -e
Below is sample output of a CPLD error, which is usually caused by a memory fault.
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 000a 04/10/13 15:03:41 CRT BMC #ff CPLD CATERR Asserted 6f [a0 1c ff]
Below is sample output of a memory ECC error. In this event, an offline memory test with a minimum of four clean passes should be run.
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 7840 08/09/11 15:10:47 MIN BMC Memory #08 Uncorrectable ECC, DIMM6/CPU1 6f [20 ff 10]
The DIMM should be more accurate and easier to interpret in 3.0.0, as shown below. This error is typically not a memory fault but rather bad data being passed to memory. Review the operating system logs (messages), dmesg and other application logs (/usr/bp/logs.dir) to determine the source of these errors.
ipmiutil ver 3.00 ievents version 3.00 RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data] 7840 08/09/11 15:10:47 MIN BMC Memory #08 Correctable ECC, P1_DIMMF1 6f [20 ff 50]
CPLD events are not DIMM-specific, but if this is an ECC error event, then the faulty DIMM may be indicated by the event, so replace the specified DIMM.
The BIOS detects a memory error, either with ECC or with CPLD, and logs it to the IPMI firmware system event log (SEL).
See http://ipmiutil.sourceforge.net for a UserGuide and other files.
For more information, see Using IPMI LAN for remote access