[总结]9i RAC LMON: terminating instance due to error 29702
‐‐‐‐‐‐‐‐‐‐ Bug 4390716 解决过程
整理: 王琦
时间: 2008‐02‐18
基本配置:
Linux AS3.0
内核版本: 2.4.21-37.ELsmp
Oracle 9.2.0.4 升级到Oracle9.2.0.7 , RAC , 两节点。 clusterware是9204软件安装的 。上海光华学院剑桥国际中心
后来的查询发现Oracle 9.2.0.4 RAC 系统升级到Oracle9.2.0.7 , Oracle RDBMS Software 是可以升级到Oracle9.2.0.7的,但是Oracle9.2.0.7 Patcht 确实没有ORACM Cluster管理软件的升级版,是Oracle的一个bug (9.2.0.7.0 Bug 4163445) , 只有从Oracle9.2.0.6 Patcht上 升级Oracle9.2.0.4 Clusterware软件ORACM (丛Oracle CM Log 中可以看到Oracle9.2.0.4 版本下安装的ORACM版本为 oracm 9.2.0.2.
0, 9.2.0.7 补丁没有升级ORACM版本,Oracle9.2.0.6 Patch升级的版本是 oracm 9.2.0.6.0.52) . 注意的一点是所有升级动作一定要严格按照Readme来操作,当然Oracle 的Readme也不一定都考虑到了,这个问题就是一个例子。
www.itpub/viewthread.php?tid=922265&extra=&highlight=%2Btolywang&page=3
(9.2.0.7.0 Bug 4163445)
问题描述:
出现的问题描述如下 (节点1 以及节点 2 交替每隔5~8天左右实例crash一次) :
alter_orcl1.log
-------------------------------------------------------------------------------------------------
Sat Jan 5 18:44:19 2008
ARC1: Evaluating archive log 1 thread 1 quence 122
ARC1: Beginning to archive log 1 thread 1 quence 122
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_122.dbf'
ARC1: Completed archiving log 1 thread 1 quence 122
Sat Jan 5 19:36:06 2008
Thread 1 advanced to log quence 124
Current log# 4 q# 124 mem# 0: /ocfs_ctrl_redo/orcl/redo04.log
Current log# 4 q# 124 mem# 1: /ocfs_data/orcl/redo04b.log
Sat Jan 5 19:36:06 2008
ARC1: Evaluating archive log 3 thread 1 quence 123
ARC1: Beginning to archive log 3 thread 1 quence 123
Creating archive destination LOG_ARCHIVE_DEST_1: '/ocfs_arch1/orcl/1_123.dbf' ARC1: Completed archiving log 3 thread 1 quence 123
Sat Jan 5 19:45:15 2008
偏食
Errors in file /u01/product/admin/orcl/bdump/orcl1_:
ORA-29702: error occurred in Cluster Group Service operation
Sat Jan 5 19:45:15 2008
LMON: terminating instance due to error 29702
Sat Jan 5 19:45:16 2008
System state dump is made for local instance
Sat Jan 5 19:45:20 2008了望
Instance terminated by LMON, pid = 14214
Sat Jan 5 19:54:53 2008
Starting ORACLE instance (normal)
Sat Jan 5 19:54:53 2008
Global Enqueue Service Resources = 26694, pool = 4
Sat Jan 5 19:54:53 2008
Global Enqueue Service Enqueues = 39350
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 2
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.7.0.
System parameters with non-default values:
process = 1000
timed_statistics = FALSE
resource_limit = TRUE
shared_pool_size = 419430400
large_pool_size = 33554432
java_pool_size = 33554432advantage
$ vi /u01/product/admin/orcl/bdump/orcl1_
=============
/u01/product/admin/orcl/bdump/orcl1_
Oracle9i Enterpri Edition Relea 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options JServer Relea 9.2.0.7.0 - Production
bigbang好听的歌
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: DELL-RAC01
Relea: 2.4.21-37.ELsmp
Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005
Machine: i686
Instance name: orcl1
Redo thread mounted by this instance: 0 <none>
Oracle process number: 4
Unix process pid: 14214, image: oracle@DELL-RAC01 (LMON)
*** SESSION ID 3.1) 2007-12-31 12:07:45.591
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send (i:2230 b:2230) Rerve 1000
GES IPC: Msg Size Regular 396 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 48, ack 53
Batching factor: cache replay 34 size per lock 56
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
CMCLI WARNING: CMInitContext: init ctx(0xb6d93f8)
*** 2007-12-31 12:07:49.396
boobies
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 2007-12-31 12:07:49.396
Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 122787
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 1 2.
Performed the unique instance identification check kjxgmps: proposing substate 3
kjxgmcs: Setting state to 1 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 1 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 1 5.
Name Service normal
Name Service recovery done
*** 2007-12-31 12:07:49.611
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 1 6.
*** 2007-12-31 12:07:49.832
*** 2007-12-31 12:07:49.832
Reconfiguration started (old inc 0, new inc 1) Synchronization timeout interval: 600 c
List of nodes:
Global Resource Directory frozen
node 0
eosinrelea 9 2 0 7
pinan* kjshashcfg: I'm the only node in the cluster (node 0) Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traverd, 0 cancelled, 0 clod
0 GCS resources traverd, 0 cancelled
t master node info
Submitted all remote-enqueue requests
Update rdomain variablesinvalidate
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 2007-12-31 12:07:50.121
0 GCS shadows traverd, 0 replayed, 0 unopened Submitted all GCS cache requests
0 write requests issued in 0 GCS resources
0 PIs marked suspect, 0 flush PI msgs
ORACM Log 当时的信息: ERROR: WriteEventPort: write failed with error 32
------------------------------------------------------------
Debug Hang :ClientProcListener (PID=14257) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket clod by peer on recv()., tid = ClientProcListener:688145 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M >ERROR: WriteEventPort: write failed with error 32., tid = ClientProcListener:688145 file = unixinc.c, line = 915 {Sat Jan 5 19:45:16 2008 }^M
Debug Hang :ClientProcListener (PID=14261) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket clod by peer on recv()., tid = ClientProcListener:622615 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M Debug Hang :ClientProcListener (PID=14255) UnRegistered with watchdog daemon. {Sat Jan 5 19:45:16 2008 }^M
>WARNING: ReadCommPort: socket clod by peer on recv()., tid = ClientProcListener:557077 file = unixinc.c, line = 767 {Sat Jan 5 19:45:16 2008 }^M
Diag trace log :
/u01/product/admin/orcl/bdump/orcl2_
Oracle9i Enterpri Edition Relea 9.2.0.7.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options JServer Relea 9.2.0.7.0 - Production
ORACLE_HOME = /u01/product/oracle
System name: Linux
Node name: DELL-RAC02
Relea: 2.4.21-37.ELsmp
Version: #1 SMP Wed Sep 7 13:28:55 EDT 2005
Machine: i686
Instance name: orcl2
Redo thread mounted by this instance: 0 <none>
Oracle process number: 3
Unix process pid: 14211, image: oracle@DELL-RAC02 (DIAG)
*** SESSION ID:(2.1) 2008-01-16 12:16:14.524
CMCLI WARNING: CMInitContext: init ctx(0xb9115f4)
kjzcprt:rcv port created
当然的英文Node id: 1
List of nodes: 0, 1,
*** 2008-01-16 12:16:14.526
Reconfiguration starts [incarn=0]
I'm the voting node
Send my bitmap to master 0
Rcfg confirmation is received from master 0
I agree with the rcfg confirmation
*** 2008-01-16 12:16:25.233
Reconfiguration completes [incarn=2]
*** 2008-01-19 04:50:21.933
Instance is terminating by process 14215 [ospid=oracle@DELL-RAC02 (LMON)] Performing diagnostic data dump for this instance
CMCLI WARNING: CommonContextCleanup: closing comm port
DIAG detachs from CM
error 29723 detected in background process
OPIRIP: Uncaught error 447. Error stack:
ORA-00447: fatal error in background process
ORA-29723: Failed to attach to the global enqueue rvice (status=32)
从metalink上面的错误描述上看,似乎是由于rac环境两个实例的libskgxn9.so不一致造成的。
处理方法:
1.由于是Oracle9.2.0.4 升级到Oracle9.2.0.7 , 而9207没有ORACM的升级版本软件,只有RDBMS的软件。 所以还必须通过9206的patcht来升级oracm9.2.0.2到oracm9.2.0.6.0.52版本。 注意了,一定要严格按照readme来操作。
2.当然升级Oracle RDBMS , Oracm9.2.0.6之后还需要运行一些catproc.sql ……等脚本来更新数据字典,这些在readme上都有。
3.有些bug是没有公布的,在google,baidu都不能找到,必须到metalink上才能看到。而