linux内核崩溃问题排查过程总结

更新时间:2023-05-30 09:10:20 阅读: 评论:0

linux内核崩溃问题排查过程总结
1.概述
某年某⽉某⽇某项⽬的线上分布式⽂件系统服务器多台linux系统kernel崩溃,严重影响了某项⽬对外提供服务的能⼒,在公司造成了不⼩影响。通过排查线上问题基本确定了是由于linux内核panic造成的原因,通过两个阶段的问题排查,基本上确定了linux内核panic的原因。排查问题的主要⼿段就是⽹上查找资料和根据内核错误⽇志分析并且构造条件重现。本⽂档就是对⾃⼰在整个问题排查过程中的总结。
2.第⼀阶段
因为刚出现问题的时候⼤家都⽐较紧急,每天加班都很晚,也制定了很多问题重现和定位原因的计划。我第⼀阶段连续坚持了两周分析问题原因,由于第⼀阶段⾃⼰所做的功能基本上全部形成了详细的分析⽂档,所以下⾯主要总结⼀下⾃⼰在第⼀阶段都采取了⼀些什么样的措施以及到达了什么效果。
第⼀阶段⾃⼰也分了⼏步⾛,当然最先想到的是重现线上的现象,所以我⾸先查看了线上的内核错误⽇志,根据⽇志显⽰主要是qmgr和master两个进程导致的内核panic(⾄少⽇志信息是这么提⽰的)。当然还结合当时服务器的现象,load⽐较⾼,对外不能提供服务。所以⾃⼰⾸先想到的就是通过写程序模拟不断发送邮件(因为qmgr和master进程都与发送邮件相关的进程),当程序运⾏起来的时候,⾃⼰⼩
⼩的激动了⼀下,就是load上去了,服务器的对外服务能⼒变慢了(ssh登录),当时的线上接近线上现象,但是后⾯内核⼀直没有panic,哪怕频率在快,⽽且内核也没有什么错误信息。后⾯渐渐的就排除了这个原因。
因为出错的服务器都安装了分布式⽂件系统,⼤家就怀疑是由于分布式⽂件系统导致了内核panic,但是通过观察业务监控信息发现那个时段分布式⽂件系统没有什么特殊的信息,⽽且数据流也不是很⼤。不过我还是使⽤⼏台虚拟机安装了分布式⽂件系统,并且写了⼀个java 程序并且不断的通过分布式⽂件系统客户端写⼊⽂件到分布式⽂件系统集群,同时也把邮件发送程序启动,尽量模拟线上的环境,跑了很多次很长时间也没有出现线上的现象,所以也没有什么更好的⼿段去重现线上的现象了。
由于重现现象失败了,所以只有根据内核的错误信息直接去分析原因了。分析步骤很简单,⾸先找到出错的错误代码,然后分析上下⽂相关的代码,分析的详细过程在去年的⽂档也体现出来了。
根据代码的分析和⽹上类似的bug基本上定位就是计算cpu调度的时间溢出,导致watchdog进程抛出panic错误,内核就挂起了。根据分析定位的原因,我⼜通过修改内核代码去构造时间溢出的条件,就是通过内核模块去修改系统调⽤时间的计数值,修改是成功了,可惜内核也直接死掉了。所以直接修改内核代码来重现也失败了。
后⾯也陆续咨询了很多公司外⾯熟悉内核的技术⼈员,他们根据我们提供的信息业给出了⾃⼰的分析,
但是也没有很好的重现⽅法和确切的定位错误原因,⽽且不同的⼈给出的结论差异也⽐较⼤。
所以第⼀个阶段连续坚持跟踪这个问题2-3周的时间也没有⼀个确切的结果。
3.第⼆阶段
新的⼀年开始了,第⼀天⼜开始准备跟踪这个问题了。⼀开始也制定了简单的计划,我对⾃⼰的计划就是每天5-8点分析定位内核问题,当然也顺便学习内核相关知识。
照顾小婴儿这⼀次⼀开始⾃⼰便换了⼀个⾓度去思考问题,去年是基于单台服务器去分析内核⽇志错误信息,已经没有什么好的⽅式了。所以准备同时分析所有出错服务器的⽇志(幸好以前找运维要了所有出错服务器的内核⽇志并且保存下来了,不然怎么死的都不知道),找出他们的共同点。⾸先他们的共同点就是出现了trace⼦系统打印的警告信息“Delta way too big!…..”的信息,但是根据相关信息,这个是不会导致linux系统挂起的。⽽且确实我们线上的服务器并不是全部都ssh不上去,不过还是在redhat官⽅⽹站找到类似的bug(url:
https: ///knowledge/solutions/70051),并且给出了解决⽅案。bug信息和解决⽅案如下:
why kernel is throwing “Delta way too big” out with
WARNING: at kernel trace ring_buffer,c:1988 rb_rerve_next_event+0x2ce/0×370 messages
0 Issue kernel is throwing ”Delta way too big” out with kernel oops on rver
Environment(环境)
•Red Hat Enterpri Linux 6 rvice pack 1
Resolution(解决⽅案)
The warning ”Delta way too big” warning might appear on a system with unstable shed clock right after the system is resumed and tracingwas enabled during the suspend.
Since it’s not realy bug, and the unstable sched clock is working fast and reliable otherwi, We suggested to keep using the sched clock in any ca and just to make note in the disables tracing by #echo 0 > /sys/kernel/debug/tracing/tracing_on
Root Cau(根本原因) this ca was ftrace involved ftrace due to teh call to ftrace_raw_event_power_end (debugfs is mounted and ftrace loaded in this ca), they are to do with problems calculating a time stamp for a ring buffer entry.
Message comes from here and appears to be indicating problems with time stability.
1966 static int
1967 rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
1968 u64 *ts, u64 *delta)
1969 {
1970 struct ring_buffer_event *event;
1971 static int once;
1972 int ret;
1973
1974 if (unlikely(*delta > (1ULL << 59) && !once++)) {
1975 int local_clock_stable = 1;
1976 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
1977 local_clock_stable = sched_clock_stable;
1978 #endif
1979 printk(KERN_WARNING ”Delta way too big! %llu”
1980 “ ts=%llu write stamp = %llu\n%s”,
1981 (unsigned long long)*delta,
1982 (unsigned long long)*ts,
德国杯赛制
1983 (unsigned long long)cpu_buffer->write_stamp,
1984 local_clock_stable ? ”" :
1985 “If you just came from a suspend/resume,\n”
温度计1986 “plea switch to the trace global clock:\n”
1987 ”echo global > /sys/kernel/debug/tracing/trace_clock\n”);
1988 WARN_ON(1);
This called from rb_rerve_next_event() here.
2122 /*
2123 * Only the first commit can update the timestamp.
2124 * Yes there is a race here. If an interrupt comes in
2125 * just after the conditional and it traces too, then it
2126 * will also check the deltas. More than one timestamp may
2127 * also be made. But only the entry that did the actual
2128 * commit will be something other than zero.
2129 */
2130 if (likely(cpu_buffer->tail_page == cpu_buffer->commit_page &&
2131 rb_page_write(cpu_buffer->tail_page) ==
2132rb_commit_index(cpu_buffer))) {
2133 u64 diff;
2134
2135 diff = ts - cpu_buffer->write_stamp;
2136
2137 /* make sure this diff is calculated here */
2138 barrier();
2139
2140 /* Did the write stamp get updated already? */
2141if (unlikely(ts < cpu_buffer->write_stamp))
2142 goto get_event;
2143
2144 delta = diff;
2145 if (unlikely(test_time_stamp(delta))) {
2146打造核心竞争力
2147 commit = rb_add_time_stamp(cpu_buffer, &ts, &delta); <—- HERE
This has to do with time stamping for ring buffer entries.
通过上⾯的信息可以看出,其实和我去年分析的代码和⽅式⼀模⼀样,只是判断原因⽅⾯我不敢确定,毕竟重现不了,只能是推测。
后⾯⼜继续分析内核中出现的另⼀个错误,“BUG: soft lockup – CPU#N stuck for 4278190091s! [qmgr/master:进程号]”,对上⾯的错误信息我做了⼀点点处理,CPU#后⾯的N是对应的⼀个具体的c
pu编号,这个在每⼀台服务器是不⼀样的,还有就是最后中括号中的进程和进程号码不同,不过就是qmgr和master。如下表格统计:
IP
107108109110111112113114
选项
⽇志
时间13:01:2014:03:3414:05:4414:22:4414:19:5814:17:1214:22:49
14:19:58错误⽇志类型和进程1qmgr1master
2qmgr1qmgr
2master1 qmgr
1qmgr
2master1qmgr
2master1qmgr
2master1qmgr种菖蒲
2master
错误类型1就是上⾯提到的不会⼀起内核挂起的错误,2就是现在分析的这个错误,会导致linux内核panic。可以看出只有107和110当时是没有挂起的。
Does Red Hat Enterpri Linux 6 or 5 have a reboot problem which is caud by sched_clock() overflow around 208.5 days?
(Updated 21 Feb 2013, 5:11 AM GMT RateSelect ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5Cancel ratingCancel ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5. Average: 5 (1 vote). Show Follow
Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in
rhel5 Content in rhel5 by Marc Milgram Content in rhel6 Content in rhel6 by Marc Milgram Content in
kernel Content in kernel by Marc Milgram Content in Red Hat Enterpri Linux Content in Red Hat Enterpri Linux by Marc Milgram Content in Kernel
Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in
Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar四川卤菜
0 Issue(问题)
•Linux Kernel panics when sched_clock() overflows after an uptime of around 208.5 days.
•Red Hat Enterpri Linux 6.1 system reboots with sched_clock() overflow after an uptime of around 208.5 days
•This symptom may happen on the systems using the CPU which has TSC.
•A process showing BUG: soft lockup - CPU#N stuck for 4278190091s!
Environment(环境)
•Red Hat Enterpri Linux 6
◦Red Hat Enterpri Linux 6.0, 6.1 and 6.2 are affected
◦veral kernels affected, e below
◦TSC clock source - **e root cau
•Red Hat Enterpri Linux 5
◦Red Hat Enterpri Linux 5.3, 5.6, 5.8: plea refer to the resolution ction for affected kernels
◦Red Hat Enterpri Linux 5.0, 5,1, 5.2, 5.4, 5.5 ,5.7: all kernels affected
◦Red Hat Enterpri Linux 5.9 and later are not affected
◦TSC clock source - **e root cau
•An approximate uptime of around 208.5 days.
Resolution(解决⽅案)
•Red Hat Enterpri Linux 6
◦Red Hat Enterpri Linux 6.x: update to kernel-2.6.32-279.el6 (from RHSA-2012-0862) or later. This kernel is already part of
RHEL6.3GA. This fix was implemented with (private) bz765720.
◦Red Hat Enterpri Linux 6.2: update to kernel-2.6.32-220.4.2.el6 (from RHBA-2012-0124) or later. This fix was implemented with (private) bz781974.
◦Red Hat Enterpri Linux 6.1 Extended Update Support: update to kernel-2.6.32-131.26.1.el6 (from RHBA-2012-0424) or later. This fix was implemented with (private) bz795817.
•Red Hat Enterpri Linux 5
◦architecture x86_64/64bit
■Red Hat Enterpri Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.
年的笔顺怎么写
■Red Hat Enterpri Linux 5.8.z: upgrade to kernel-2.6.18-308.11.1.el5 (from RHSA-2012-1061) or later.广州时代广场
■Red Hat Enterpri Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.
■Red Hat Enterpri Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.
◦architecture x86/32bit
■Red Hat Enterpri Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.
■Red Hat Enterpri Linux 5.8.z: upgrade to kernel-2.6.18-308.13.1.el5 (from RHSA-2012-1174) or later.
■Red Hat Enterpri Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.
■Red Hat Enterpri Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.
Root Cau(根本原因)
•An insufficiently designed calculation in the CPU accelerator in the previous kernel caud an arithmetic overflow in the sched_clock() function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter (TSC) clock source.
•This problem will occur only when system uptime becomes 208.5 days or exceeds 208.5 days.
•This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer occur under the circumstances.
•On Red Hat Enterpri 5, this problem is a timing issue and very very rare to happen.
•**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast access clock whereas the HPET and PMTimer are both slow access clocks. Using notsc would be a significant performance hit.
Diagnostic Steps
Note:
This issue could likely happen in numerous locals that deal with time
in the kernel. For example, a ur running a non-Red Hat kernel had the
kernel panic with a soft lockup in __ticket_spin_lock.
通过上⾯的信⼼我们完全可以确认这个是linux内核的⼀个bug,这个bug的原因上⾯也相信描述了,就是对于x86_64体系结构的内核版本,如果启动时间超过208.5天就会导致溢出。
虽然得到了上⾯的信息证实了内核panic的原因,不过⾃⼰想了解⼀下淘宝的内核⼯程师是否也应该遇到过同样的问题,所以就在qq上找以前聊过的淘宝内核⼯程师确认这个问题。结果证明:他们也遇到过同样的错误,并且也不能重现,解决⽅案还是升级内核版本。
4.总结
到此为⽌基本上已经可以确认这个问题了,排查问题的过程是艰⾟的,但是当你终于得到想要的答案了你将会是兴奋⽆⽐,这个和升职加薪没有任何的关系,这个就是技术的乐趣吧。

本文发布于:2023-05-30 09:10:20,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/950879.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:内核   问题   分析   系统   原因   重现
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图