Oralce数据库ASM存储管理-存储IO故障,disk未及时offline分
析,故障分析模板
背景说明:
1、Oracle12.2RAC+ASM Normal Redendancy 模式,数据库存储采⽤双存储冗余架构,规避单存储故障导致服务中断及数据丢
失;
2、 ASM DiskGroup 设计2个 Failgroup(FG),1个FG磁盘全部存储在1#存储;1个FG全部磁盘存储在2#存储中;
3、期望任意存储故障或断电,数据库实例不受影响,数据不丢失,故障存储上线后数据⾃动同步。
在实际⾼可⽤测试中,拔掉1个存储,发现如下现象:
1.CRS集群不受影响,ocr/votedisk⾃动Failover;
2.DB Controlfile/Redolog发⽣I/O错误,导致LWGR/CKPT等核⼼进程长时间阻塞后,Oracle主动重启DB
实例(1个或2个实例)
后,数据库恢复正常;
3.数据库数据正常,故障存储Online后⾃动同步正常。
节日歌曲02测试过程
1) 第⼀类测试
1、存储完成拔线:16:56:05
2、实例16:57:37-16:57:39 挂掉
ASM⽇志:
2018-08-01T16:57:41.712885+08:00
NOTE: ASM client node11:node1:node1-rac disconnected unexpectedly
DB:
2018-08-01T16:57:45.214182+08:00
Instance terminated by USER, pid = 10158
2018-08-01T16:57:36.704927+08:00
Errors in file /oracle/diag/rdbms/node1/node11/trace/node11_:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
ORA-15064: communication failure with ASM instance
2018-08-01T16:57:36.705340+08:00
Errors in file /oracle/diag/rdbms/node1/node11/trace/node11_:
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '+DG_DATA_FAB/NODE1/CONTROLFILE/current.265.981318275'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
ORA-15064: communication failure with ASM instance
Oracle CKPT 进程因为控制⽂件 IO 错误阻塞,导致主动重启 instance,每次测试都在超时70s之后开始Terminate instance。怀疑是ASM实例offline disk时间过慢,希望调⾼CKPT阻塞时间阀值解决问题,但是没有找到对应的参数。
既然是controlfile存在此问题,是不是因为DATA磁盘⽐较多,导致offline检测时间长呢?
尝试将controlfile转移到磁盘较少的REDO DG,仍然在controfile这⾥报错:
systemstatedump⽂件:
----- Beginning of Customized Incident Dump(s) -----
Process CKPT (ospid: 4693) is waiting for event 'control file quential read'.
好听的女生英文名
Process O009 (ospid: 5080) is the blocker of the wait chain.
===[ Wait Chain ]===
CKPT (ospid: 4693) waits for event 'control file quential read'.
LGWR (ospid: 4691) waits for event 'KSV master wait'.
O009 (ospid: 5080) waits for event 'ASM file metadata operation'.
node1_
----- END DDE Actions Dump (total 0 cc) -----
ORA-15080: synchronous I/O operation failed to write block 1031 of disk 4 in disk group DG_REDO_MOD
ORA-27063: number of bytes read/written is incorrect
bothering
island是什么意思HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: 4294967295
Additional information: 1024
NOTE: process _lgwr_node1 (4691) initiating offline of disk 4.4042263303 (DG_REDO_MOD_0004) with mask 0x7e in group 3 (DG_REDO_MO
D) with client assisting
2) 第⼆类测试
尝试对 controlfile 进⾏ multiplex:
1、每个存储分配1个10GB LUN给服务器;
2、基于每个LUN创建1个DG,controlfile multiplex到这2个DG中。
重新开始模拟1个存储故障测试,发现仍然会发⽣控制⽂件⽆法读写,重启实例!
在Oracle⽂档发现只能采⽤ASM FG来实现⾼可⽤,因为任何控制⽂件都需要在线,否则将直接导致实例中⽌!
Multiplex Control Files on Different Disks
Every Oracle Databa should have at least two control files, each stored on a different physical disk. If a control file is damaged due to a disk failure, the associated instance must be shut down. Once the disk drive is repaired, the damaged control file can be restored using the intact copy of the control file from the other disk and the instance can be restarted. In this ca, no media recovery is required.
The behavior of multiplexed control files is this:
The databa writes to all filenames listed for the initialization parameter CONTROL_FILES in the databa initialization parameter file.
The databa reads only the first file listed in the CONTROL_FILES parameter during databa operation.
If any of the control files become unavailable during databa operation, the instance becomes inop
opinionserable and should be aborted.
Note:
Oracle strongly recommends that your databa has a minimum of two control files and that they are located on parate physical disks.
所以这种 multiplex ⽅法对 controlfile 的⾼可⽤⽆效!
3) 第三类测试
将controlfile存储在⼀个RPT存储中,避免因为controlfile同步导致的阻塞。
发现有时测试能够成功,但是有时会在REDO LOG读写时报错导致DB重启!
4) 第四类测试
创建2个独⽴的DG,指向2个不同存储,REDO GROUP的2个member multiplex到2个DG中。
测试failover成功,ASM实例会将故障DG dismount,数据库完全不受影响!
根据以上的测试过程,发现如下现象:
1、 ASM Failgroup对数据库⽂件处理完全没有问题,可以实现Failover
2、 ControlFile/RedoLogfile在Normal DG做offline时,异常长时间阻塞并主动重启DB实例,重启后运⾏正常,数据完整性不受
影响!
反复多次测试,问题均随机出现,因此⾼度怀疑为Oracle BUG,在MOS上发现1个类似『 链接:Bug 23179662 - ASM B-slave Process Blocking Fatal background Process like LGWR producing ORA-29771 (⽂档 ID 23179662.8)』,但是MOS说
明 20180417PSU 已经 fixed 此 BUG, Wordaround ⾏为就是重启实例。
在连续1周⽆法解决问题的情况,采⽤了如下临时的解决⽅案:
(1)controlfile迁移到第三个存储;
法国大选候选人
(2)Redo通过Oracle multiplex功能将2个member存储到不同存储。
但是这样⼀来,控制⽂件⼜造成了单点故障风险,难道这个问题没有办法解决吗?
点读机教材下载中心
既然理论和实战存在差异,那肯定是有原因的,我开始了新的探索及分析,不放过任何⼀个可疑点:
03再次梳理
重新将controlfile及redologs迁移到Normal Diskgroup中,测试中发现数据库实例存在正常、1节点重启、2节点重启等多种情况,故障现象不规律!
我反复测试,细致梳理了关键事件的时间点信息,⽰例如下:
ALERT LOG :excite在线翻译
--------------
Filename=alert_p4moddb1.log
2018-08-16T14:56:00.272280+08:00
WARNING: Read Failed. group:2 disk:4 AU:1053 offt:2605056 size:16384
path:/dev/rdisk/MES1_p4_moddb_redo02
incarnation:0xf7e12348 synchronous result:'I/O error'
subsys:System krq:0x9ffffffffd1c0608 bufp:0x9ffffffffd007000 osderr1:0x69c0 osderr2:0x0
IO elapd time: 0 uc Time waited on I/O: 0 ucditto什么意思
WARNING: failed to read mirror side 1 of virtual extent 7 logical extent 0 of file 260 in group [2.3551108175] from disk MES1_REDO02 allocation unit 1053 reason error; if possible, will try another mirror side
NOTE: successfully read mirror side 2 of virtual extent 7 logical extent 1 of file 260 in group [2.3551108175] from disk RPT_REDO01 allocation unit 1052 -->检测到I/O error,但是能成功读取mirror数据
……
2018-08-16T14:56:13.489201+08:00. -->⼤量IO操作错误
Errors in file /oracle/diag/rdbms/p4moddb/p4moddb1/trace/p4moddb1_:
ORA-15080: synchronous I/O operation failed to write block 1383 of disk 4 in disk group DG_REDO_MOD
ORA-27063: number of bytes read/written is incorrectironic
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: 4294967295
Additional information: 1024
WARNING: failed to write mirror side 1 of virtual extent 0 logical extent 0 of file 257 in group 2 on disk 4 allocation unit 277
2018-08-16T14:56:31.050369+08:00
……
ERROR: cannot read disk header of disk MES1_REDO02 (4:4158726984)
2018-08-16T14:56:34.418045+08:00
NOTE: ospid 13682 initiating cluster wide offline of disk 5 in group 2
2018-08-16T14:56:34.418576+08:00
NOTE: process _rms0_p4moddb1 (13666) initiating offline of disk 4.4158726984 (MES1_REDO02) with mask 0x7e in group 2 (DG_REDO_MOD) with client assisting