实战课堂：为什么更换存储之后一切正常但RAC集群启动不了？（上）

2018年6月14日

数据和云

1726

这是一次来自生产实践的真实案例，某客户核心生产库由于进行新老存储替换变更操作后，Oracle RAC 两个节点均无法打开，数据库遭遇严重故障。

先来看看告警日志中记录的错误信息，我们注意到数据库能够正常Mount，但是在Open阶段遇到错误，提示某个数据文件不能被锁定：

ALTER DATABASE MOUNT
This instance was first to mount
Fri Dec 16 03:20:34 2016
Successful mount of redo thread 2, with mount id 3052566649
Fri Dec 16 03:20:40 2016
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT
ALTER DATABASE OPEN
This instance was first to open
Fri Dec 16 03:00:53 2016
Errors in file /diag/ojtdb/ojtdb2/trace/ojtdb2_ora_1569.trc:
ORA-01157: cannot identify/lock data file 1311 - see DBWR trace file
ORA-01110: data file 1311: '/dev/vx/rdsk/ora_ojt10/ora_dev1311'
ORA-1157 signalled during: ALTER DATABASE OPEN

通常遇到这个错误，最常见的原因是存储未正常挂载，或者数据文件损坏丢失。

通过dd检测发现I/O读正常，dbv检测数据文件也正常，这说明存储和数据文件的基本表现都属正常，而且不存在权限问题：

oracle$ time dd if=/dev/vx/rdsk/ora_ojt10/ora_dev1311 of=/dev/null bs=1024k count=1000
1000+0 records in
1000+0 records out

real 0m1.654s
user 0m0.006s
sys 0m0.316s

oracle$ dbv file=/dev/vx/rdsk/ora_ojt10/ora_dev1311 blocksize=8192

那么显然，这个问题跳出了传统的故障原因。

我们再回顾一下错误信息，数据库启动失败源自DBWR报错，可以通过进程跟踪来分析进程的工作状态。

在此我们通过v$session，分析DBW0进程当前的等待事件，排查dbwr进程迟迟不向其他进程发送相关ipc message的原因：

SQL> select event from v$session where paddr in
(select addr from v$process where pid=<13>);
EVENT
------------------------
Disk file operations I/O

DBWR 始终在执行磁盘 I/O 操作，Disk file operations I/O等待事件的官方解释如下：

This event is used to wait for disk file operations (for example, open, close, seek, and resize). It is also used for miscellaneous I/O operations such as block dumps and password file accesses.

此事件用于等待磁盘文件操作（例如，打开，关闭，搜索和调整大小）。它也用于一些其他I / O操作，例如块转储和密码文件访问。

这个等待事件的三个参数p1，p2，p3 含义如下：

持续观察DBWR该事件的参数值，发现 p1=2，意思是 file open，在进行文件打开操作；p3=2，意思是操作数据文件。

而 p2 一直在递增变化，其含义是文件号，不断顺序的打开不同的数据文件。

说明DBW0正在逐个检查并打开数据文件，查询得知该库数据文件有1560个，在逐个检查进行到900秒后，后台alert日志中出现了如下错误，这一次出现问题的是另外一个文件：

ORA-01157: cannot identify/lock data file 1401 - see DBWR trace file
ORA-01110: data file 1401: '/dev/vx/rdsk/ora_ojt10/ora_dev1401'
ORA-1157 signalled during: ALTER DATABASE OPEN

在启动过程中，我们发现从mount到open阶段，报出 ORA-01157 和 ORA-01110 花费的时间很长，需要十几分钟，然后报错，使得启动过程强制终止；

最终是后台dbw进程在输出日志，为了进一步分析该过程中具体流程，于是开启10046事件跟踪dbw0进程：

*** 2016-12-16 12:03:30.798--这里时间是12点03，开始执行open操作
WAIT #0: nam='SQL*Net message from client' ela= 7344875 driver id=1650815232 #bytes=1 p3=0 obj#=-1 tim=8068132685065
XCTEND rlbk=0, rd_only=1, tim=8068132685665
=====================
PARSING IN CURSOR #18446744071440810672 len=19 dep=0 uid=0 oct=35 lid=0 tim=8068132686043 hv=1907384048 ad='176ffabd68' sqlid='a01hp0psv0rrh'
alter database open

*** 2016-12-16 12:18:49.122
WAIT #18446744071440810672: nam='rdbms ipc reply' ela= 2010056 from_process=23 timeout=2 p3=0 obj#=-1 tim=8069051099141
*** 2016-12-16 12:18:51.038--之前都是空闲等待，从这个时间点开始，当前进程等待超时，发送超时消息给dbw0进程。而这个时间点18分。
WAIT #18446744071440810672: nam='ksdxexeotherwait' ela= 1915544 p1=0 p2=0 p3=0 obj#=-1 tim=8069053014980
WAIT #18446744071440810672: nam='control file sequential read' ela= 818 file#=0 block#=1 blocks=1 obj#=-1 tim=8069053016115
WAIT #18446744071440810672: nam='control file sequential read' ela= 496 file#=0 block#=39 blocks=1 obj#=-1 tim=8069053016665
WAIT #18446744071440810672: nam='control file sequential read' ela= 393 file#=0 block#=41 blocks=1 obj#=-1 tim=8069053017100
WAIT #18446744071440810672: nam='control file sequential read' ela= 8212 file#=0 block#=2300 blocks=1 obj#=-1 tim=8069053025356
WAIT #18446744071440810672: nam='CSS initialization' ela= 20622 p1=0 p2=0 p3=0 obj#=-1 tim=8069053046642
WAIT #18446744071440810672: nam='CSS operation: action' ela= 2656 function_id=65 p2=0 p3=0 obj#=-1 tim=8069053049357

从dbw0进程的跟踪日志来看，发现进程等待超时，然后发送超时消息给dbw0进程。从12:03分发起open操作，到12:18分出现超时消息，（18-3）*60ms = 900ms，即900秒后，后台发送超时消息给dbw0进程，而这个时间点正是后台报错而启动终止的时间。

这个信息给出了很好的分析线索，随即继续分析跟踪文件，详细分析每一次I/O的状态：

waited for 'Disk file operations I/O', seq_num: 617
p1: 'FileOperation'=0x2
p2: 'fileno'=0x1a9
p3: 'filetype'=0x2
time_waited: 0.696732 sec (sample interval: 0 sec)
[1 sample, 07:57:37]
waited for 'Disk file operations I/O', seq_num: 614
p1: 'FileOperation'=0x2
p2: 'fileno'=0x1a7
p3: 'filetype'=0x2
time_waited: 0.699569 sec (sample interval: 0 sec)
[1 sample, 07:57:36]
waited for 'Disk file operations I/O', seq_num: 613
p1: 'FileOperation'=0x2
p2: 'fileno'=0x1a6
p3: 'filetype'=0x2
time_waited: 0.682626 sec (sample interval: 0 sec)
[1 sample, 07:57:35]
waited for 'Disk file operations I/O', seq_num: 611
p1: 'FileOperation'=0x2
p2: 'fileno'=0x1a4
p3: 'filetype'=0x2
time_waited: 0.695486 sec (sample interval: 0 sec)
[1 sample, 07:57:34]
waited for 'Disk file operations I/O', seq_num: 610
p1: 'FileOperation'=0x2
p2: 'fileno'=0x1a3
p3: 'filetype'=0x2
time_waited: 0.700244 sec (sample interval: 0 sec)
[1 sample, 07:57:33]
waited for 'Disk file operations I/O', seq_num: 608
p1: 'FileOperation'=0x2
p2: 'fileno'=0x1a1
p3: 'filetype'=0x2
time_waited: 0.682131 sec (sample interval: 0 sec)
---------------------------------------------------
Sampled Session History Summary:
longest_non_idle_wait: 'Disk file operations I/O'
[1 sample, 07:57:51 ]
time_waited: 0.910736 sec (sample interval: 0 sec)

注意以上日志，可以发现平均打开一个数据文件的Disk file operations I/O操作大概需要0.7s左右，该库共有 1560个数据文件，照此计算，打开所有数据文件大约需要1092秒，这个时间超出了之前得出的从发起open到报错终止的900秒的。

那么我们接下来要想办法，让数据库在报错之前等待更久。