Modules
Sign in
Get Help

Diagnosing memory errors with IPMI

SUMMARY

Diagnosing memory errors with IPMI

ISSUE

Newer Unitrends DPU platforms use IPMI firmware which can log memory errors. For example:

Recovery-712

Recovery-713

Recovery-813

Recovery-822

Recovery-823

Recovery-833-100

Recovery-833-200

Recovery-943

Use IPMI commands to see memory errors in the firmware log.

RESOLUTION

  1. Download an updated ipmiutil. Skip this step if ipmiutil-3.0.0 or later is already installed.
    • For CentOS 6:
      wget ftp://ftp.unitrends.com/support/Hotfixes/ipmiutil-3.0.0-1_el6.x86_64.rpm
    • For CentOS 5:
      wget  ftp://ftp.unitrends.com/support/Hotfixes/ipmiutil-3.0.0-1_el5.x86_64.rpm
  2. Update the RPM package:
    rpm -U ipmiutil-3.0.0*.rpm
  3. Look for any recent memory events:
    ipmiutil sel -e
    

Below is sample output of a CPLD error, which is usually caused by a memory fault.
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
000a 04/10/13 15:03:41 CRT BMC   #ff CPLD CATERR Asserted 6f [a0 1c ff]
 

Below is sample output of a memory ECC error.  In this event, an offline memory test with a minimum of four clean passes should be run.

RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
7840 08/09/11 15:10:47 MIN BMC  Memory #08 Uncorrectable ECC, DIMM6/CPU1 6f [20 ff 10]
 

The DIMM should be more accurate and easier to interpret in 3.0.0, as shown below.  This error is typically not a memory fault but rather bad data being passed to memory.  Review the operating system logs (messages), dmesg and other application logs (/usr/bp/logs.dir) to determine the source of these errors.

ipmiutil ver 3.00
ievents version 3.00
RecId Date/Time_______ SEV Src_ Evt_Type___ Sens# Evt_detail - Trig [Evt_data]
7840 08/09/11 15:10:47 MIN BMC  Memory #08 Correctable ECC, P1_DIMMF1 6f [20 ff 50]
 

CPLD events are not DIMM-specific, but if this is an ECC error event, then the faulty DIMM may be indicated by the event, so replace the specified DIMM.

CAUSE

The BIOS detects a memory error, either with ECC or with CPLD, and logs it to the IPMI firmware system event log (SEL). 

NOTES

See http://ipmiutil.sourceforge.net for a UserGuide and other files.
For more information, see Using IPMI LAN for remote access 

Have more questions?

Contact us

Was this article helpful?
0 out of 0 found this helpful

Provide feedback for the Documentation team!

Browse this section