DMS(Dead Man Switch

更新时间:2023-07-02 14:15:38 阅读：评论：0

excel怎么用1. dms 的介绍：

DMS（deadman switch)是用来描述系统kernel extension用的，它可以在系统崩溃前down掉系统，并产生dump文件，以供日后检查。
DMS存在的目的是为了保护共享外置硬盘及数据，当系统挂起时间长过一定限制时间时，DMS会自动down掉该系统，由hacmp的备份节点接管系统，以保护数据和业务的正常进行，避免潜在的问题，特别是外置磁盘阵列。

2. DMS 的起因：

DMS起作用的原因主要有以下几点：
a. 某种应用程序的优先级大于clstrmgr deamon , 导致clstrmgr无法正常ret DMS计数器。
b. 在系统上存在大量IO 操作，导致cpu 没有时间相应clstrmgr deamon .
c. 内存泄漏或溢出问题
d. 大量的系统错误日志活动，如：（token-ring beaconing 问题）

3. 如何检查是否系统发生了DMS

我们可以通过分析DUMP文件来看，如：

# crash /dev/lv00
Using /unix as the default namelist file.
> cpu
Selected cpu number : 0
> stat
------sysname: AIX
------nodename: sp13
------relea: 3
------version: 4
------machine: 00091968A400
-

-----time of crash: Sat Aug 31 04:36:52 EDT 2002
------age of system: 5 day, 21 hr., 6 min.
------xmalloc debug: disabled
------abend code: 700
------csa: 0x438eb0
------exception struct:
------0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
------panic: HACMP for AIX dms timeout - ha
.
> status
CPU TID TSLOT --PID PSLOT STOPPED PROC_NAME
0 --205 ----2 --204 ----2 ----yes wait
1 --307 ----3 --306 ----3 ----yes wait
2 --409 ----4 --408 ----4 ----yes wait
3 --50b ----5 --50a ----5 ----yes wait

4 --60d ----6 --60c ----6---- yes wait
5 -1867 ---24 -125a -- 18 ----yes errdemon
数学知识6 --811 ----8 --810 ----8---- yes wait
7 --913 ----9 --912 ----9 ----yes wait
> t -mk
Skipping first MST
.
MST STACK TRACE:
0x00438eb0 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=5)
IAR: -----.panic_trap+0 (00012678): tweq r1,r1
LR: ------.[dms:dead_man_sw_handler]+18 (0171335c)
00438d40: .[dms:timeout_end]+4c (01713b98)
00438d80: .clock+134 (0002e9a8)
00438de0: .i_softmod+2a8 (0001c3b0)

00438e70: flih_603_patch+cc (00028b74)
.
0x2ff3b400 (excpt=00000000:00000000:00000000:00000000:00000000)
(intpri=11)
IAR: -----.waitproc_find_run_queue+c0 (000255e0): addic r3,r0,-4
LR: ----- .waitproc+a0 (00025aa4)
2ff3b328: .waitproc+a0 (00025aa4)
2ff3b388: .procentry+14 (00098288)
2ff3b3c8: .low+0 (00000000)
.
> symptom
PIDS/5765C3403 LVLS/430 PCSS/SPI1 MS/700 FLDS/panic_tra VALU/7c810808
FLDS/[dms:dead VALU/18

或者检查 errpt , 如：

errpt -a
-------
LABEL: ----- ----- KERNEL_PANIC赵廷隐
IDENTIFIER: ----- 225E3B63

Date/Time: -------Thu Apr 25 21:26:16
Sequence Number: 609
Machine Id: ----- 0040613A4C00
Node Id: ---------localhost
Class: ------------S
Type: -------------TEMP
Resource Name: ---PANIC

Description

SOFTWARE PROGRAM ABNORMALLY TERMINATED

---Recommended Actions
---PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ASSERT STRING

PANIC STRING
HACMP for AIX dms timeout - halting hung node

　４．避免ＤＭＳ的几种方法：
　
　ａ．调整系统的io pacing

　如：＃smitty chsys　如下调整高低水印新生婴儿拉稀怎么办

Maximum number of PROCESSES allowed per ur -----[128]

Maximum number of pages in block I/O BUFFER CACHE [20]

Maximum Kbytes of real memory allowed for MBUFS --[0]

Automatically REBOOT system after a crash --------fal
Continuously maintain DISK I/O history -----------fal
HIGH water mark for pending write I/Os per file --[33]

LOW water mark for pending write I/Os per file ---[24]

Amount of usable physical memory in Kbytes -------262144

State of system keylock at boot time -------------normal
Enable full CORE dump ----------------------------fal
U pre-430 style CORE dump ----------------------fal
Enable CPU Guard ---------------------------------disable
ｂ．调快ｃｐｕ同步频率，(系统默认６０秒）　

如果客户安装了ｈａｃｍｐ４．４．０或以上版本，再ｈａｃｍｐ菜单中可以直接设置
建议可以改为 10 秒

Smitty cm_tuning_parms_chsyncd我国消防工作的方针是
Type or lect values in entry fields.
Press Enter AFTER making all desired changes.

syncd frequency (in conds) ----[60]

Esc+1=Help -Esc+2=Refresh Esc+3=Cancel Esc+4=List
Esc+5=Ret Esc+6=Command Esc+7=Edit --Esc+8=Image
Esc+9=Shell Esc+0=Exit ---Enter=Do

如果ｈａｃｍｐ版本比较低，可以修改　/sbin/rc.boot 文件中的sync 值。

如：
echo "Starting the sync daemon" | alog -t boot
nohup /usr/sbin/syncd 60 > /dev/null 2>&1 &

ｃ．　减慢心跳线诊断频率：

smitty cm_config_networks.chg_pre.lect

Change a Cluster Network Module using Predefined Values

Type or lect values in entry fields.
Press Enter AFTER making all desired changes.

---- ---- ---- ---- ---- ---- -[Entry Fields]
* Network Module Name ---- - IP
New Network Module Name ----[]
Description ---- ---- ---- ---[Generic IP]
Failure Detection Rate ---- -Normal >> slow

d. 调整网络参数；
# no -a
extendednetstats = 0
thewall = 6048
sockthresh = 85

sb_max = 1048576
somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1

#no -o thewall=131052

#no -a

extendednetstats = 0
高一军thewall = 131052
sockthresh = 85
sb_max = 1048576

somaxconn = 1024
clean_partial_conns = 0
net_malloc_police = 1
net_malloc_frag_mask = 0
rto_low = 1

e. 如果客户安装了hacmp软件又发生了DMS , 则可以检查一下是否机器运行了电源管理软件（power management ),如果是，请关闭电源管理。如：
smitty pm

-------------------------------Power Management

Move cursor to desired item and press Enter.

Enable / Disable Power Management State Transition

Configure / Unconfigure Power Management
System State Transition from Enable State
Change / Show Characteristics of Power Management
Power Management Timer
Display Power Management
Power Management Characteristics of Each Device
Battery

集群中为了正确处理节点失败，需要判断节点是否死掉。这期间deadman switch使用失败探测参数设置的相关参数进行判断
如果i/o memory等有问题都可能使集群管理器不能正常处理节点通讯，而错误地使集群节点死掉
所以要调整些参数
1.i/o pacing
2.syncd

3.增加通信子系统使用内存量
4更改错误探测速率

HACMP DEADMAN SWITCH TIMEOUT CONDITIONS

ITEM: RTA000065193

An SE has some problems with a customer site with HACMP V2.1

installed here in Japan.

The system consists of 2(two) RS/6000 590s with 2(two) 9333-501s

as shared disks, and the configuration of HACMP is idle-standby.

The problem was that the system had crashed on a rver node and

takeover had occurred once or twice a week with the increa of

CPU utilization.

They investigated the system dump and reached the conclusion that

the deadman switch of HACMP caud this crash.

They opened the PMR (326X6611760) and changed the "cycles_to_fail"

parameter of clstrmgr from 4 (default value) to 12 according to

the advice of the PMR. Since then, no crash has occurred.

期望越大失望越大I also suggested to tune the I/O pacing parameter in addtion to

the tuining of "cycles_to_fail" parameter to them.

The customer, however, is worring about the crash-problem might

occure again with the further increa of CPU load toward the end

of the year.

Could you clarify the following points to mitigate their concerns?

Q1. Could you explain what the "kernel lock" mean?

In the README file of PTF U432018, chapter of "AIX Kernel Lock

when Using HACMP/6000" shows the veral caus to have the

deadmen switch halt the system. In that chapter, the cond

paragraph says that "This is due to the 3.2.x AIX kernel being

built in a way that caus events to be threaded through a

single kernel lock. During I/O intensive operations (df, find,

etc.), the cluster manager process may be required to wait too

long for a chance at the kernel lock."

Are there any lock function ud by clstrmgr to get AIX kernel

rvices? Or, does it just mean that the process are rialized

to access I/O queue?

Q2. I understand that the tuning the I/O pacing and the "cycles_to

姿势英语

fail" depends upon the trials_and_errors basis test.

But, are there any performance parameters or data to be

monitored to anticipate the tendency toward the timeout of

deadman switch ?

The customer wants to know how the CPU utilization or I/O

utilization are related to the probability of the timeout

本文发布于:2023-07-02 14:15:38，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1064645.html

上一篇：REMOTE MONITORING SYSTEM

下一篇：Method for monitoring the operation of a doctor as

标签：系统节点是否

留言与评论（共有 0 条评论）