内存故障

更新时间:2023-03-09 13:11:45 阅读: 评论:0

酸奶保质期多久-孕妇能补牙吗

内存故障
2023年3月9日发(作者:客服话术大全)

Linux内存错误诊断

先了解⼀些概念

DRAM(DynamicRandomAccessMemory),即动态,最为常见的。ECC是“ErrorCheckingandCorrecting”的简写,中⽂名称是“错误检查和纠正”。ECC内存,即应⽤了能够实现错误检查和纠正技

术(ECC)的内存条。EDAC,即ErrorDetectionAndCorrection(错误检测与纠正)。

内存有两种错误类型分别是CE和UE,CE是CorrectableError的简称,UE是UncorrectableError的简称,CE即可恢复的错误,暂不影响系统的正常运⾏。可以在找时机停机换掉。UE为不可恢复的内

存错误,通常会导致宕机。

系统messages⽇志

[root@my-hostmg4a]#grepkernel/var/log/messages

Jan1419:01:11my-hostkernel:mce:[HardwareError]:Machinecheckeventslogged

Jan1419:01:12my-hostkernel:EDACMC0:1CEmemoryreaderroronCPU_SrcID#0_Ha#1_Chan#1_DIMM#0(channel:5slot:0page:0x554c02offt:0x3c0grain:32syndrome:0x0-area:DRAMerr_code:0001:0091socke

[root@my-hostmg4a]#grep"[0-9]"/sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0

/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1

/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0

/sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0

/sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0

/sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0

/sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0

/sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0

[root@my-hostmg4a]#dmidecode-t1

#dmidecode3.0

GettingSMBIOSdatafromsysfs.

SMBIOS2.7prent.

Handle0x0044,DMItype1,27bytes

SystemInformation

Manufacturer:LENOVO

ProductName:LenovoSystemx3750M4-[8753IH5]-

Version:03

SerialNumber:06FF367

UUID:C4EF8080-7926-11E5-8B14-6C0B849B418E

Wake-upType:Other

SKUNumber:XxXxXxX

Family:SystemX

这是另外⼀台设备messges⽇志

Jun2713:53:25irora30kernel:[HardwareError]:MC4Error(node2):DRAMECCerrordetectedontheNB.

Jun2713:53:25irora30kernel:EDACamd64MC2:CEERROR_ADDRESS=0x8de3b1960

Jun2713:53:25irora30kernel:EDACMC2:CEpage0x8de3b1,offt0x960,grain0,syndrome0xab40,row5,channel0,label"":amd64_edac

Jun2713:53:25irora30kernel:[HardwareError]:ErrorStatus:Correctederror,noactionrequired.

Jun2713:53:25irora30kernel:[HardwareError]:CPU:1(15:2:0)MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]:0x8c204000ab080a13

Jun2713:53:25irora30kernel:[HardwareError]:MC4_ADDR:0x00000008de3b1960

Jun2713:53:25irora30kernel:[HardwareError]:cachelevel:L3/GEN,mem/io:MEM,mem-tx:RD,part-proc:RES(notimeout)

Jun2714:19:27irora30auditd[5571]:Auditdaemonrotatinglogfiles

Jun2719:09:23irora30auditd[5571]:Auditdaemonrotatinglogfiles

Jun2723:59:21irora30auditd[5571]:Auditdaemonrotatinglogfiles

Jun2802:15:55irora30kernel:[HardwareError]:MC4Error(node2):DRAMECCerrordetectedontheNB.

Jun2802:15:55irora30kernel:EDACamd64MC2:CEERROR_ADDRESS=0x8d9ea5960

Jun2802:15:55irora30kernel:EDACMC2:CEpage0x8d9ea5,offt0x960,grain0,syndrome0xab40,row5,channel0,label"":amd64_edac

Jun2802:15:55irora30kernel:[HardwareError]:ErrorStatus:Correctederror,noactionrequired.

Jun2802:15:55irora30kernel:[HardwareError]:CPU:1(15:2:0)MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]:0x8c204000ab080813

Jun2802:15:55irora30kernel:[HardwareError]:MC4_ADDR:0x00000008d9ea5960

Jun2802:15:55irora30kernel:[HardwareError]:cachelevel:L3/GEN,mem/io:MEM,mem-tx:RD,part-proc:SRC(notimeout)

Jun2803:08:25irora30kernel:[HardwareError]:MC4Error(node2):DRAMECCerrordetectedontheNB.

Jun2803:08:25irora30kernel:EDACamd64MC2:CEERROR_ADDRESS=0x8ded39960

Jun2803:08:25irora30kernel:EDACMC2:CEpage0x8ded39,offt0x960,grain0,syndrome0xab40,row5,channel0,label"":amd64_edac

Jun2803:08:25irora30kernel:[HardwareError]:ErrorStatus:Correctederror,noactionrequired.

Jun2803:08:25irora30kernel:[HardwareError]:CPU:1(15:2:0)MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]:0x8c204000ab080813

Jun2803:08:25irora30kernel:[HardwareError]:MC4_ADDR:0x00000008ded39960

Jun2803:08:25irora30kernel:[HardwareError]:cachelevel:L3/GEN,mem/io:MEM,mem-tx:RD,part-proc:SRC(notimeout)

Jun2803:45:13irora30rhsmd:InorderforSubscriptionManagertoprovideyoursystemwithupdates,enteryourRedHatlogintoensureyoursystemisup-t

Jun2804:44:25irora30auditd[5571]:Auditdaemonrotatinglogfiles

Jun2809:34:22irora30auditd[5571]:Auditdaemonrotatinglogfiles

Jun2810:02:30irora30ansible-command:Invokedwithwarn=Trueexecutable=None_us_shell=True_raw_params=df-hl/var|awk'NR>1&&int($5)>80'removes=Nonecreates=Nonechdir=None

Jun2814:23:49irora30auditd[5571]:Auditdaemonrotatinglogfiles

Jun2819:09:25irora30auditd[5571]:Auditdaemonrotatinglogfiles

故障确认及定位故障内存槽位

[root@irora30~]#grep"[0-9]"/sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

/sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0

/sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0

/sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294

/sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0

/sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0

/sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0

/sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0

/sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0

/sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0

[root@irora30~]

#count:不为0的⾏即代表存在内存错误。

mc:第⼏个CPU。

csrow:内存通道。

ch*:通道内的第⼏根内存。

内存安装情况

1MemoryComponentStatus

2

3Proc1DIMM1A16384MB1333MHz

4

5Proc1DIMM2INotinstalledNotinstalled

6

7Proc1DIMM3ENotinstalledNotinstalled

8

9Proc1DIMM4CNotinstalledNotinstalled

10

11Proc1DIMM5KNotinstalledNotinstalled

12

13Proc1DIMM6GNotinstalledNotinstalled

14

15Proc1DIMM7B16384MB1333MHz

16

17Proc1DIMM8JNotinstalledNotinstalled

18

19Proc1DIMM9FNotinstalledNotinstalled

20

21Proc1DIMM10DNotinstalledNotinstalled

22

23Proc1DIMM11LNotinstalledNotinstalled

24

25Proc1DIMM12HNotinstalledNotinstalled

26

27Proc2DIMM1A16384MB1333MHz

28

29Proc2DIMM2INotinstalledNotinstalled

30

31Proc2DIMM3ENotinstalledNotinstalled

32

33Proc2DIMM4CNotinstalledNotinstalled

34

35Proc2DIMM5KNotinstalledNotinstalled

36

37Proc2DIMM6GNotinstalledNotinstalled

38

39Proc2DIMM7B16384MB1333MHz

40

41Proc2DIMM8JNotinstalledNotinstalled

42

43Proc2DIMM9FNotinstalledNotinstalled

44

45Proc2DIMM10DNotinstalledNotinstalled

46

47Proc2DIMM11LNotinstalledNotinstalled

48

49Proc2DIMM12HNotinstalledNotinstalled

50

51Proc3DIMM1A16384MB1333MHz

52

53Proc3DIMM2INotinstalledNotinstalled

54

55Proc3DIMM3ENotinstalledNotinstalled

56

57Proc3DIMM4CNotinstalledNotinstalled

58

59Proc3DIMM5KNotinstalledNotinstalled

60

61Proc3DIMM6GNotinstalledNotinstalled

62

63Proc3DIMM7B16384MB1333MHz

64

65Proc3DIMM8JNotinstalledNotinstalled

66

67Proc3DIMM9FNotinstalledNotinstalled

68

69Proc3DIMM10DNotinstalledNotinstalled

70

71Proc3DIMM11LNotinstalledNotinstalled

72

73Proc3DIMM12HNotinstalledNotinstalled

74

75Proc4DIMM1A16384MB1333MHz

76

77Proc4DIMM2INotinstalledNotinstalled

78

79Proc4DIMM3ENotinstalledNotinstalled

80

81Proc4DIMM4CNotinstalledNotinstalled

82

83Proc4DIMM5KNotinstalledNotinstalled

84

85Proc4DIMM6GNotinstalledNotinstalled

86

87Proc4DIMM7B16384MB1333MHz

88

89Proc4DIMM8JNotinstalledNotinstalled

90

91Proc4DIMM9FNotinstalledNotinstalled

92

93Proc4DIMM10DNotinstalledNotinstalled

94

95Proc4DIMM11LNotinstalledNotinstalled

96

97Proc4DIMM12HNotinstalledNotinstalled

使⽤edac⼯具来检测服务器内存故障

随着虚拟化,Redis,BDB内存数据库等应⽤的普及,现在越来越多的服务器配置了⼤容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,⽀持的内存⾼达960GB。对于ECC,REG这些

带有纠错功能的内存故障检测是⼀件很头疼的事情,出现故障,还是可以连续运⾏⼏个⽉甚⾄⼏年,但如果运⽓不好,随时都会挂掉,好在linux中提供了⼀个edac-utils内存纠错诊断⼯具,可以⽤来检

查服务器内存潜在的故障。

下⾯以CentOS为例,介绍下edac-utils⼯具的使⽤.

在使⽤edac-utils⼯具之前,需要先了解服务器的硬件架构,以DELLR620为例,(其它如HPDL360PG8,IBMX3650M4机型都使⽤了E5-2600系列CPU,C600系列芯⽚组.⼤致相同)其CPU内存控

制器对应通道,内存槽关系,如下所⽰。

处理器0(对应⼀个内存控制器)

通道0:内存插槽A1、A5和A9

通道1:内存插槽A2、A6和A10

通道2:内存插槽A3、A7和A11

通道3:内存插槽A4、A8和A12

处理器1(对应⼀个内存控制器)

通道0:内存插槽B1、B5和B9

通道1:内存插槽B2、B6和B10

通道2:内存插槽B3、B7和B11

通道3:内存插槽B4、B8和B12

1.安装edac-utils⼯具

yuminstall-ylibsysfdac-utils

2.执⾏检测命令,可查看纠错提⽰如下

edac-util-v

1mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#0_DIMM#0:A1

2mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#1_DIMM#0:A2

3mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#2_DIMM#0:A3

4mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#3_DIMM#0:A4

5mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#0_DIMM#1:A5

6mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#1_DIMM#1:A6

7mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#2_DIMM#1:A7

8mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#3_DIMM#1:A8

9mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#0_DIMM#2:A9

10mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#1_DIMM#2:A10

11mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#2_DIMM#2:A11

12mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#3_DIMM#2:A12

13

14mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0:B1

15mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#1_DIMM#0:B2

16mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#2_DIMM#0:B3

17mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#3_DIMM#0:B4

18mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#0_DIMM#1:B5

19mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#1_DIMM#1:B6

20mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#2_DIMM#1:B7

21mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#3_DIMM#1:B8

22mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#0_DIMM#1:B9

23mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#1_DIMM#1:B10

24mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#2_DIMM#1:B11

25mc1:csrow2:CPU_SrcID#1_Ha#0_Chan#3_DIMM#1:B12

其中

mc06表⽰表⽰内存控制器0;

CPU_Src_ID#0表⽰源CPU0;

Channel#0表⽰通道0;

DIMM#0标⽰内存槽0;

CorrectedErrors代表已经纠错的次数;

根据前⾯列出的CPU通道和内存槽对应关系即可给edac-utils返回的信息进⾏编号。

即可得出A1槽6312次纠错,B1槽6459次纠错,B3槽535次纠错.3条内存出现潜在故障,接下来联系供应商进⾏更换即可。

12条内存的对应关系

1mc0:csrow0:CPU#0Channel#0_DIMM#0:A1

2mc0:csrow0:CPU#0Channel#1_DIMM#0:A2

3mc0:csrow0:CPU#0Channel#2_DIMM#0:A3

4mc0:csrow1:CPU#0Channel#0_DIMM#1:A4

5mc0:csrow1:CPU#0Channel#1_DIMM#1:A5

6mc0:csrow1:CPU#0Channel#2_DIMM#1:A6

7

8mc1:csrow0:CPU#1Channel#0_DIMM#0:B1

9mc1:csrow0:CPU#1Channel#1_DIMM#0:B2

10mc1:csrow0:CPU#1Channel#2_DIMM#0:B3

11mc1:csrow1:CPU#1Channel#0_DIMM#1:B4

12mc1:csrow1:CPU#1Channel#1_DIMM#1:B5

13mc1:csrow1:CPU#1Channel#2_DIMM#1:B6

20条内存的对应关系

1mc0:0UncorrectedErrorswithnoDIMMinfo

2mc0:0CorrectedErrorswithnoDIMMinfo

3mc0:csrow0:0UncorrectedErrors

4mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#0_DIMM#0:0CorrectedErrorsA1

5mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#1_DIMM#0:0CorrectedErrorsB1

6mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#2_DIMM#0:0CorrectedErrorsC1

7mc0:csrow0:CPU_SrcID#0_Ha#0_Chan#3_DIMM#0:0CorrectedErrorsD1

8mc0:csrow1:0UncorrectedErrors

9mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#0_DIMM#1:0CorrectedErrorsA2

10mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#1_DIMM#1:0CorrectedErrorsB2

11mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#2_DIMM#1:0CorrectedErrorsC2

12mc0:csrow1:CPU_SrcID#0_Ha#0_Chan#3_DIMM#1:0CorrectedErrorsD2

13mc0:csrow2:0UncorrectedErrors

14mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#0_DIMM#2:0CorrectedErrorsA3

15mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#1_DIMM#2:11CorrectedErrorsB3

16mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#2_DIMM#2:0CorrectedErrorsC3

17mc0:csrow2:CPU_SrcID#0_Ha#0_Chan#3_DIMM#2:0CorrectedErrorsD3

18mc1:0UncorrectedErrorswithnoDIMMinfo

19mc1:0CorrectedErrorswithnoDIMMinfo

20mc1:csrow0:0UncorrectedErrors

21mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#0_DIMM#0:0CorrectedErrors

22mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#1_DIMM#0:0CorrectedErrors

23mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#2_DIMM#0:0CorrectedErrors

24mc1:csrow0:CPU_SrcID#1_Ha#0_Chan#3_DIMM#0:0CorrectedErrors

25mc1:csrow1:0UncorrectedErrors

26mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#0_DIMM#1:0CorrectedErrors

27mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#1_DIMM#1:0CorrectedErrors

28mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#2_DIMM#1:0CorrectedErrors

29mc1:csrow1:CPU_SrcID#1_Ha#0_Chan#3_DIMM#1:0CorrectedErrors

30

314x16关系

32mc0:csrow0:CPU#0Channel#0_DIMM#0:0CorrectedErrors8a

33mc0:csrow0:CPU#0Channel#1_DIMM#0:0CorrectedErrors5b

34mc0:csrow0:CPU#0Channel#2_DIMM#0:0CorrectedErrors2c

35mc0:csrow1:0UncorrectedErrors

36mc0:csrow1:CPU#0Channel#0_DIMM#1:1CorrectedErrors7d

37mc0:csrow1:CPU#0Channel#1_DIMM#1:0CorrectedErrors4e

38mc0:csrow1:CPU#0Channel#2_DIMM#1:0CorrectedErrors1f

39mc0:csrow2:0UncorrectedErrors

40mc0:csrow2:CPU#0Channel#0_DIMM#2:0CorrectedErrors6G

41mc0:csrow2:CPU#0Channel#1_DIMM#2:0CorrectedErrors3h

本文发布于:2023-03-09 13:11:44,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/zuowen/1678338705193999.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

本文word下载地址:内存故障.doc

本文 PDF 下载地址:内存故障.pdf

下一篇:返回列表
标签:内存故障
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图