Linux内存错误诊断
先了解⼀些概念
DRAM(DynamicRandomAccessMemory),即动态,最为常见的。ECC是“ErrorCheckingandCorrecting”的简写,中⽂名称是“错误检查和纠正”。ECC内存,即应⽤了能够实现错误检查和纠正技
术(ECC)的内存条。EDAC,即ErrorDetectionAndCorrection(错误检测与纠正)。
内存有两种错误类型分别是CE和UE,CE是CorrectableError的简称,UE是UncorrectableError的简称,CE即可恢复的错误,暂不影响系统的正常运⾏。可以在找时机停机换掉。UE为不可恢复的内
存错误,通常会导致宕机。
系统messages⽇志
[root@my-hostmg4a]#grepkernel/var/log/messages
Jan1419:01:11my-hostkernel:mce:[HardwareError]:Machinecheckeventslogged
Jan1419:01:12my-hostkernel:EDACMC0:1CEmemoryreaderroronCPU_SrcID#0_Ha#1_Chan#1_DIMM#0(channel:5slot:0page:0x554c02offt:0x3c0grain:32syndrome:0x0-area:DRAMerr_code:0001:0091socke
[root@my-hostmg4a]#grep"[0-9]"/sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0
[root@my-hostmg4a]#dmidecode-t1
#dmidecode3.0
GettingSMBIOSdatafromsysfs.
SMBIOS2.7prent.
Handle0x0044,DMItype1,27bytes
SystemInformation
Manufacturer:LENOVO
ProductName:LenovoSystemx3750M4-[8753IH5]-
Version:03
SerialNumber:06FF367
UUID:C4EF8080-7926-11E5-8B14-6C0B849B418E
Wake-upType:Other
SKUNumber:XxXxXxX
Family:SystemX
这是另外⼀台设备messges⽇志
Jun2713:53:25irora30kernel:[HardwareError]:MC4Error(node2):DRAMECCerrordetectedontheNB.
Jun2713:53:25irora30kernel:EDACamd64MC2:CEERROR_ADDRESS=0x8de3b1960
Jun2713:53:25irora30kernel:EDACMC2:CEpage0x8de3b1,offt0x960,grain0,syndrome0xab40,row5,channel0,label"":amd64_edac
Jun2713:53:25irora30kernel:[HardwareError]:ErrorStatus:Correctederror,noactionrequired.
Jun2713:53:25irora30kernel:[HardwareError]:CPU:1(15:2:0)MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]:0x8c204000ab080a13
Jun2713:53:25irora30kernel:[HardwareError]:MC4_ADDR:0x00000008de3b1960
Jun2713:53:25irora30kernel:[HardwareError]:cachelevel:L3/GEN,mem/io:MEM,mem-tx:RD,part-proc:RES(notimeout)
Jun2714:19:27irora30auditd[5571]:Auditdaemonrotatinglogfiles
Jun2719:09:23irora30auditd[5571]:Auditdaemonrotatinglogfiles
Jun2723:59:21irora30auditd[5571]:Auditdaemonrotatinglogfiles
Jun2802:15:55irora30kernel:[HardwareError]:MC4Error(node2):DRAMECCerrordetectedontheNB.
Jun2802:15:55irora30kernel:EDACamd64MC2:CEERROR_ADDRESS=0x8d9ea5960
Jun2802:15:55irora30kernel:EDACMC2:CEpage0x8d9ea5,offt0x960,grain0,syndrome0xab40,row5,channel0,label"":amd64_edac
Jun2802:15:55irora30kernel:[HardwareError]:ErrorStatus:Correctederror,noactionrequired.
Jun2802:15:55irora30kernel:[HardwareError]:CPU:1(15:2:0)MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]:0x8c204000ab080813
Jun2802:15:55irora30kernel:[HardwareError]:MC4_ADDR:0x00000008d9ea5960
Jun2802:15:55irora30kernel:[HardwareError]:cachelevel:L3/GEN,mem/io:MEM,mem-tx:RD,part-proc:SRC(notimeout)
Jun2803:08:25irora30kernel:[HardwareError]:MC4Error(node2):DRAMECCerrordetectedontheNB.
Jun2803:08:25irora30kernel:EDACamd64MC2:CEERROR_ADDRESS=0x8ded39960
Jun2803:08:25irora30kernel:EDACMC2:CEpage0x8ded39,offt0x960,grain0,syndrome0xab40,row5,channel0,label"":amd64_edac
Jun2803:08:25irora30kernel:[HardwareError]:ErrorStatus:Correctederror,noactionrequired.
Jun2803:08:25irora30kernel:[HardwareError]:CPU:1(15:2:0)MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]:0x8c204000ab080813
Jun2803:08:25irora30kernel:[HardwareError]:MC4_ADDR:0x00000008ded39960
Jun2803:08:25irora30kernel:[HardwareError]:cachelevel:L3/GEN,mem/io:MEM,mem-tx:RD,part-proc:SRC(notimeout)
Jun2803:45:13irora30rhsmd:InorderforSubscriptionManagertoprovideyoursystemwithupdates,enteryourRedHatlogintoensureyoursystemisup-t
Jun2804:44:25irora30auditd[5571]:Auditdaemonrotatinglogfiles
Jun2809:34:22irora30auditd[5571]:Auditdaemonrotatinglogfiles
Jun2810:02:30irora30ansible-command:Invokedwithwarn=Trueexecutable=None_us_shell=True_raw_params=df-hl/var|awk'NR>1&&int($5)>80'removes=Nonecreates=Nonechdir=None
Jun2814:23:49irora30auditd[5571]:Auditdaemonrotatinglogfiles
Jun2819:09:25irora30auditd[5571]:Auditdaemonrotatinglogfiles
故障确认及定位故障内存槽位
[root@irora30~]#grep"[0-9]"/sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294
/sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0
[root@irora30~]
#count:不为0的⾏即代表存在内存错误。
mc:第⼏个CPU。
csrow:内存通道。
ch*:通道内的第⼏根内存。
内存安装情况
1MemoryComponentStatus
2
3Proc1DIMM1A16384MB1333MHz
4
5Proc1DIMM2INotinstalledNotinstalled
6
7Proc1DIMM3ENotinstalledNotinstalled
8
9Proc1DIMM4CNotinstalledNotinstalled
10
11Proc1DIMM5KNotinstalledNotinstalled
12
13Proc1DIMM6GNotinstalledNotinstalled
14
15Proc1DIMM7B16384MB1333MHz
16
17Proc1DIMM8JNotinstalledNotinstalled
18
19Proc1DIMM9FNotinstalledNotinstalled
20
21Proc1DIMM10DNotinstalledNotinstalled
22
23Proc1DIMM11LNotinstalledNotinstalled
24
25Proc1DIMM12HNotinstalledNotinstalled
26
27Proc2DIMM1A16384MB1333MHz
28
29Proc2DIMM2INotinstalledNotinstalled
30
31Proc2DIMM3ENotinstalledNotinstalled
32
33Proc2DIMM4CNotinstalledNotinstalled
34
35Proc2DIMM5KNotinstalledNotinstalled
36
37Proc2DIMM6GNotinstalledNotinstalled
38
39Proc2DIMM7B16384MB1333MHz
40
41Proc2DIMM8JNotinstalledNotinstalled
42
43Proc2DIMM9FNotinstalledNotinstalled
44
45Proc2DIMM10DNotinstalledNotinstalled
46
47Proc2DIMM11LNotinstalledNotinstalled
48
49Proc2DIMM12HNotinstalledNotinstalled
50
51Proc3DIMM1A16384MB1333MHz
52
53Proc3DIMM2INotinstalledNotinstalled
54
55Proc3DIMM3ENotinstalledNotinstalled
56
57Proc3DIMM4CNotinstalledNotinstalled
58
59Proc3DIMM5KNotinstalledNotinstalled
60
61Proc3DIMM6GNotinstalledNotinstalled
62
63Proc3DIMM7B16384MB1333MHz
64
65Proc3DIMM8JNotinstalledNotinstalled
66
67Proc3DIMM9FNotinstalledNotinstalled
68
69Proc3DIMM10DNotinstalledNotinstalled
70
71Proc3DIMM11LNotinstalledNotinstalled
72
73Proc3DIMM12HNotinstalledNotinstalled
74
75Proc4DIMM1A16384MB1333MHz
76
77Proc4DIMM2INotinstalledNotinstalled
78
79Proc4DIMM3ENotinstalledNotinstalled
80
81Proc4DIMM4CNotinstalledNotinstalled
82
83Proc4DIMM5KNotinstalledNotinstalled
84
85Proc4DIMM6GNotinstalledNotinstalled
86
87Proc4DIMM7B16384MB1333MHz
88
89Proc4DIMM8JNotinstalledNotinstalled
90
91Proc4DIMM9FNotinstalledNotinstalled
92
93Proc4DIMM10DNotinstalledNotinstalled
94
95Proc4DIMM11LNotinstalledNotinstalled
96
97Proc4DIMM12HNotinstalledNotinstalled
使⽤edac⼯具来检测服务器内存故障
随着虚拟化,Redis,BDB内存数据库等应⽤的普及,现在越来越多的服务器配置了⼤容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,⽀持的内存⾼达960GB。对于ECC,REG这些
带有纠错功能的内存故障检测是⼀件很头疼的事情,出现故障,还是可以连续运⾏⼏个⽉甚⾄⼏年,但如果运⽓不好,随时都会挂掉,好在linux中提供了⼀个edac-utils内存纠错诊断⼯具,可以⽤来检
查服务器内存潜在的故障。
下⾯以CentOS为例,介绍下edac-utils⼯具的使⽤.
在使⽤edac-utils⼯具之前,需要先了解服务器的硬件架构,以DELLR620为例,(其它如HPDL360PG8,IBMX3650M4机型都使⽤了E5-2600系列CPU,C600系列芯⽚组.⼤致相同)其CPU内存控
制器对应通道,内存槽关系,如下所⽰。
处理器0(对应⼀个内存控制器)
通道0:内存插槽A1、A5和A9
通道1:内存插槽A2、A6和A10
通道2:内存插槽A3、A7和A11
通道3:内存插槽A4、A8和A12
处理器1(对应⼀个内存控制器)
通道0:内存插槽B1、B5和B9
通道1:内存插槽B2、B6和B10
通道2:内存插槽B3、B7和B11
通道3:内存插槽B4、B8和B12
1.安装edac-utils⼯具
yuminstall-ylibsysfdac-utils
2.执⾏检测命令,可查看纠错提⽰如下
edac-util-v
1mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#0_DIMM#0:A1
2mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#1_DIMM#0:A2
3mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#2_DIMM#0:A3
4mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#3_DIMM#0:A4
5mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#0_DIMM#1:A5
6mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#1_DIMM#1:A6
7mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#2_DIMM#1:A7
8mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#3_DIMM#1:A8
9mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#0_DIMM#2:A9
10mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#1_DIMM#2:A10
11mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#2_DIMM#2:A11
12mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#3_DIMM#2:A12
13
14mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0:B1
15mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#1_DIMM#0:B2
16mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#2_DIMM#0:B3
17mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#3_DIMM#0:B4
18mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#0_DIMM#1:B5
19mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#1_DIMM#1:B6
20mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#2_DIMM#1:B7
21mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#3_DIMM#1:B8
22mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#0_DIMM#1:B9
23mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#1_DIMM#1:B10
24mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#2_DIMM#1:B11
25mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#3_DIMM#1:B12
其中
mc06表⽰表⽰内存控制器0;
CPU_Src_ID#0表⽰源CPU0;
Channel#0表⽰通道0;
DIMM#0标⽰内存槽0;
CorrectedErrors代表已经纠错的次数;
根据前⾯列出的CPU通道和内存槽对应关系即可给edac-utils返回的信息进⾏编号。
即可得出A1槽6312次纠错,B1槽6459次纠错,B3槽535次纠错.3条内存出现潜在故障,接下来联系供应商进⾏更换即可。
12条内存的对应关系
1mc0:csrow0:CPU#0Channel#0_DIMM#0:A1
2mc0:csrow0:CPU#0Channel#1_DIMM#0:A2
3mc0:csrow0:CPU#0Channel#2_DIMM#0:A3
4mc0:csrow1:CPU#0Channel#0_DIMM#1:A4
5mc0:csrow1:CPU#0Channel#1_DIMM#1:A5
6mc0:csrow1:CPU#0Channel#2_DIMM#1:A6
7
8mc1:csrow0:CPU#1Channel#0_DIMM#0:B1
9mc1:csrow0:CPU#1Channel#1_DIMM#0:B2
10mc1:csrow0:CPU#1Channel#2_DIMM#0:B3
11mc1:csrow1:CPU#1Channel#0_DIMM#1:B4
12mc1:csrow1:CPU#1Channel#1_DIMM#1:B5
13mc1:csrow1:CPU#1Channel#2_DIMM#1:B6
20条内存的对应关系
1mc0:0UncorrectedErrorswithnoDIMMinfo
2mc0:0CorrectedErrorswithnoDIMMinfo
3mc0:csrow0:0UncorrectedErrors
4mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#0_DIMM#0:0CorrectedErrorsA1
5mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#1_DIMM#0:0CorrectedErrorsB1
6mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#2_DIMM#0:0CorrectedErrorsC1
7mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#3_DIMM#0:0CorrectedErrorsD1
8mc0:csrow1:0UncorrectedErrors
9mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#0_DIMM#1:0CorrectedErrorsA2
10mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#1_DIMM#1:0CorrectedErrorsB2
11mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#2_DIMM#1:0CorrectedErrorsC2
12mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#3_DIMM#1:0CorrectedErrorsD2
13mc0:csrow2:0UncorrectedErrors
14mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#0_DIMM#2:0CorrectedErrorsA3
15mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#1_DIMM#2:11CorrectedErrorsB3
16mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#2_DIMM#2:0CorrectedErrorsC3
17mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#3_DIMM#2:0CorrectedErrorsD3
18mc1:0UncorrectedErrorswithnoDIMMinfo
19mc1:0CorrectedErrorswithnoDIMMinfo
20mc1:csrow0:0UncorrectedErrors
21mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0:0CorrectedErrors
22mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#1_DIMM#0:0CorrectedErrors
23mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#2_DIMM#0:0CorrectedErrors
24mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#3_DIMM#0:0CorrectedErrors
25mc1:csrow1:0UncorrectedErrors
26mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#0_DIMM#1:0CorrectedErrors
27mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#1_DIMM#1:0CorrectedErrors
28mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#2_DIMM#1:0CorrectedErrors
29mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#3_DIMM#1:0CorrectedErrors
30
314x16关系
32mc0:csrow0:CPU#0Channel#0_DIMM#0:0CorrectedErrors8a
33mc0:csrow0:CPU#0Channel#1_DIMM#0:0CorrectedErrors5b
34mc0:csrow0:CPU#0Channel#2_DIMM#0:0CorrectedErrors2c
35mc0:csrow1:0UncorrectedErrors
36mc0:csrow1:CPU#0Channel#0_DIMM#1:1CorrectedErrors7d
37mc0:csrow1:CPU#0Channel#1_DIMM#1:0CorrectedErrors4e
38mc0:csrow1:CPU#0Channel#2_DIMM#1:0CorrectedErrors1f
39mc0:csrow2:0UncorrectedErrors
40mc0:csrow2:CPU#0Channel#0_DIMM#2:0CorrectedErrors6G
41mc0:csrow2:CPU#0Channel#1_DIMM#2:0CorrectedErrors3h
本文发布于:2023-03-09 13:11:44,感谢您对本站的认可!
本文链接:https://www.wtabcd.cn/fanwen/zuowen/1678338705193999.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文word下载地址:内存故障.doc
本文 PDF 下载地址:内存故障.pdf
留言与评论(共有 0 条评论) |