我应该多认真对待ECC可纠正的错误警告?

我有一堆Sun X2200-M2服务器.这些服务器具有ECC内存.

在其中一些服务器中,我在eLOM中收到有关“检测到可纠正的ECC错误”的警告,例如:

# ssh regress11 ipmitool sel elist
   1 | 05/20/2010 | 14:20:27 | Memory CPU0 DIMM2 | Correctable ECC | Asserted
   2 | 05/20/2010 | 14:33:47 | Memory CPU0 DIMM2 | Correctable ECC | Asserted

……比其他人更频繁.

此特定系统上的内核也会引发EDAC错误,尽管频率远高于eLOM记录ECC事件的频率:

EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x42a194, offset 0x60, grain 8, syndrome 0xf654, row 4, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error
EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
MC0: CE page 0x48cb94, offset 0x10, grain 8, syndrome 0xf654, row 5, channel 1, label "": k8_edac
MC0: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC0: extended error code: ECC chipkill x4 error

现在,如果服务器检测到不可纠正的ECC,系统会重置,因此很清楚,这样做很糟糕,删除/替换已识别的棒或对会纠正问题.

但我认为如果错误是可纠正的,那么就没有立即问题 – 我可以将此视为警告并准备好在发生无法纠正的错误时拉动棒/对?

取决于您获得错误的频率.由于各种原因,ECC应该平均每年纠正一次错误.如果你的速度明显快于它们,或者它们是多位错误,你应该担心(我会尽快更换RAM).

此外,ECC并不完美.累积错误可能通过ECC;这将显示为操作系统崩溃或类似问题.

翻译自:https://serverfault.com/questions/144151/how-seriously-should-i-take-ecc-correctable-error-warnings

转载注明原文:我应该多认真对待ECC可纠正的错误警告?