<< 返回文章列表

(抢先下载)19C新特性:Voting Disk管理

2019年3月25日
李敏
1592

《云和恩墨技术通讯》(3月刊)下载链接:《云和恩墨技术通讯》(3月刊)


众所周知,Oracle在一个机房内实现真正双活的唯一方式是RAC,而如果这个机房出现整个机房的故障(比如机房断电,油机忘了加油了,或者机房地震了,机房着火了,机房进水了等等),那么,双活也将不复存在。为了解决这个问题,Oracle将RAC的范围进行了扩大,原意是将RAC的节点们分布在物理位置不同的机房,这样一旦某个机房整个下线,其他机房的节点可以承担起原来RAC该承担的责任。愿景很美好,原理也说得通。


但是,实现这个目的,有两个关键的点,其中最需要考量的是多个机房之间的链路质量(存储链路和大二层问题),这个超出本文讨论的范围。其次是,第三地表决。双活里面,除了两个业务机房,还需要有第三个仲裁机构不属于这两个机房,来维持在极端情况下的正确选举存活节点(站点)。


在双活项目的实施里,大体有两种方案,早前为此写过一个片子,现在找不到了。大体是,一种是存储来做,一种是主机层来做。存储层来做的话,方案有很多,如EMC的vplex metro,HDS的GAD等等。这些方案对DBA来说是非常友好的,因为双活和第三地表决不由DBA来规划和控制。第二种,主机层做的双活(基于ASM),这部分除了链路DBA不能主导之外,剩下的,如faillure group,3rd voting都是需要DBA来介入的。


本文只取其中一个部分,就是3rd voting,来阐述在19C extended rac下,第三地表决的管理。

环境说明,4节点RAC


对于第三地表决盘的选用,土豪家庭可以选择在第三个机房单独弄个存储。一般家庭会在第三个机房放置一个NFS服务器,把NFS挂给所有的RAC节点,dd出一个文件,当做第三地quorum盘。除了NFS的方案,也可以在第三地使用ISCSI服务器,从ISCSI服务器上map一个盘到所有RAC节点上,此方式跟NFS相比,我个人觉得ISCSI的方式更好,因为NFS中是一个文件,考虑到NFS挂载,误删等等情况下,ISCSI更好。目前使用较多的是NFS。


根据Oracle官方白皮书《Oracle Clusterware 11g Release 2 (11.2) –Using standard NFS to support a third voting file for extended cluster configurations》中的方式,配置好NFS服务器,把文件系统挂载到各个RAC节点之后。使用如下方式dd创建一个文件当做第三地表决盘,注意,这里dd出来的文件,会被ASM当做一个member磁盘来使用。区分这里的voting disk 和voting file的区别。


在我的环境的初始安装中,我并没有配置NFS,使用Normal的冗余方式,从Site 1和Site 2中选了3块盘,做一个OCRC1的磁盘组来放置OCR和Voting file。做好了之后效果是下面这样的:


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

[grid@extended01 ~]$ ocrcheck

Status of Oracle Cluster Registry is as follows :

         Version                  :          4

         Total space (kbytes)     :     491684

         Used space (kbytes)      :      84756

         Available space (kbytes) :     406928

         ID                       : 1234425375

         Device/File Name         :     +OCRC1

                                    Device/File integrity check succeeded

 

                                    Device/File not configured

 

                                    Device/File not configured

 

                                    Device/File not configured

 

                                    Device/File not configured

 

         Cluster registry integrity check succeeded

 

         Logical corruption check bypassed due to non-privileged user

 

[grid@extended01 ~]$

1

2

3

4

5

6

7

[grid@extended01 vote3nd]$ crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

1. ONLINE   0f6da8ca07774f84bf4f3b118a754d52 (/dev/mapper/s1ocr01) [OCRC1]

2. ONLINE   3af408a91eb74fcbbf83a4a38b77d064 (/dev/mapper/s2ocr07) [OCRC1]

3. ONLINE   3921fae437694f05bf27f01e5adbce99 (/dev/mapper/s1ocr04) [OCRC1]

 


注意,这三个Voting disk都是来自存储上的LUN。s1和s2是两个站点,可以看到有两个盘来自s1,这样有违双活的设计目的,因为一旦s1站点的两个voting disk都掉了,那整个RAC集群都会失效。所以,只能在s1,s2,以及第三个站点上,各放置一个voting disk。下面就来做这个事情。


首先,我们先在NFS上,dd创建一个500M的文件当做voting disk:

1

dd if=/dev/zero of=/vote3nd/vote_nfs_3 bs=1M count=500

然后,尝试向集群中添加,我以为会成功,事实上是不行的。

1

2

3

[root@extended01 ~]# /u01/app/19.2.0.1/grid/bin/crsctl add css votedisk /vote3nd/vote_nfs_3

 

CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM


这里提示的是如果voting file在ASM上的话,是不允许用add的方式添加的。Oracle的引文中的意思是当前冗余度已经交由ASM来管理。同样,删除也是不行。


1

2

[grid@extended01 ~]$ crsctl delete css votedisk 0f6da8ca07774f84bf4f3b118a754d52

CRS-4258: Addition and deletion of voting files are not allowed because the voting files are on ASM


此后,准备用replace的方式来尝试将ASM上的voting全部替换到共享文件系统上,显然,也是不行的。


1

2

3

4

5

6

[grid@extended01 vote3nd]$ /u01/app/19.2.0.1/grid/bin/crsctl replace votedisk  /vote3nd/vote_nfs_3

 

Now formatting voting disk: /vote3nd/vote_nfs_3.

CRS-4601: Failed to initialize voting file /vote3nd/vote_nfs_3.

CRS-4628: Change to configuration failed, but was successfully rolled back.

CRS-4000: Command Replace failed, or completed with errors.


所以,对于要把dd出来的文件当做一个磁盘的时候,必须把这个文件添加进ASM磁盘组才行。注意,这里需要修改asm的disk string,不然添加会报错。

这里我有一个误区,我以为像在cfs上放置ocr一样,需要事先把这个文件dd或者touch出来,事实上,voting放到cfs上是不需要创建这个文件的。

下面是修改string之前的报错,修改之后,这个文件可以被加进ASM磁盘组。


1

2

3

4

5

6

7

SQL> alter diskgroup OCRC1 add quorum failgroup FGQ DISK '/vote3nd/vote_nfs_3';

alter diskgroup OCRC1 add quorum failgroup FGQ DISK '/vote3nd/vote_nfs_3'

*

ERROR at line 1:

ORA-15032: not all alterations performed

ORA-15031: disk specification '/vote3nd/vote_nfs_3' matches no disks

ORA-15014: path '/vote3nd/vote_nfs_3' is not in the discovery set


在完成这个盘的添加操作和简单梳理之后,新的voting disk构成如下:


1

2

3

4

5

6

7

[grid@extended01 ~]$ crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

1. ONLINE   0f6da8ca07774f84bf4f3b118a754d52 (/dev/mapper/s1ocr01) [OCRC1]

2. ONLINE   ded2df4f62df4f91bf9a5028ecaa4c5b (/vote3nd/vote_nfs_3) [OCRC1]

3. ONLINE   8917739853ce4fa3bf30d120235d0131 (/dev/mapper/s2ocr05) [OCRC1]

Located 3 voting disk(s).


满足三地的基本需求。

在RAC的设计中,3块voting disk是允许坏一块的,坏一个的情况下,不影响集群稳定性。这里测试将NFS服务器关闭,来观察GI在这种情况下的动作。


1

2

3

4

5

6

[root@target ~]# service nfs stop

Shutting down NFS daemon:                                  [  OK  ]

Shutting down NFS mountd:                                  [  OK  ]

Shutting down NFS quotas:                                  [  OK  ]

Shutting down NFS services:                                [  OK  ]

[root@target ~]#


我个人不喜欢NFS的另一个主要原因是,这个时候,NFS关闭了,其他主机执行df -h等命令是会hang住的。并且umount /vote3rd也是hang住的(需要强制卸载这个挂载点,主机才能响应重启命令,不然会一直挂起)。



命令行hang住

此时查询voting disk的情况如下,可以看到有个voting disk下线了,集群状态正常。


1

2

3

4

5

6

7

[grid@extended01 ~]$ crsctl query css votedisk  

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

1. ONLINE   0f6da8ca07774f84bf4f3b118a754d52 (/dev/mapper/s1ocr01) [OCRC1]

2. OFFLINE  ded2df4f62df4f91bf9a5028ecaa4c5b (/vote3nd/vote_nfs_3) [OCRC1]

3. ONLINE   8917739853ce4fa3bf30d120235d0131 (/dev/mapper/s2ocr05) [OCRC1]

Located 3 voting disk(s).


在集群日志中,有如下的信息:


1

2

3

4

5

6

7

8

9

 

2019-03-18 12:02:37.032 [OCSSD(6654)]CRS-1615: No I/O has completed after 50% of the maximum interval. If this persists, voting file /vote3nd/vote_nfs_3 will be considered not functional in 99940 milliseconds.

 

2019-03-18 12:03:27.038 [OCSSD(6654)]CRS-1614: No I/O has completed after 75% of the maximum interval. If this persists, voting file /vote3nd/vote_nfs_3 will be considered not functional in 49930 milliseconds.

 

2019-03-18 12:03:57.051 [OCSSD(6654)]CRS-1613: No I/O has completed after 90% of the maximum interval. If this persists, voting file /vote3nd/vote_nfs_3 will be considered not functional in 19920 milliseconds.

 

2019-03-18 12:04:17.079 [OCSSD(6654)]CRS-1604: CSSD voting file is offline: /vote3nd/vote_nfs_3; details at (:CSSNM00058:) in /u01/app/grid/diag/crs/extended01/crs/trace/ocssd.trc.

2019-03-18 12:04:17.080 [OCSSD(6654)]CRS-1672: The number of voting files currently available 2 has fallen to the minimum number of voting files required 2. Further reduction in voting files will result in eviction and loss of functionality


日志也显示,目前只有2块voting disk,并且这是维持集群状态最少数量就是2。这个时候再减少voting disk,整个集群就会整个瘫痪。

在NFS服务器启动之后,事实上,有些时候,3rd voting disk是不会重新上线的。必须要在节点上,umount -l 再重新mount这个NFS目录才可以让第三个表决盘重新上线。


1

2019-03-18 12:13:00.052 [OCSSD(6654)]CRS-1605: CSSD voting file is online: /vote3nd/vote_nfs_3; details in /u01/app/grid/diag/crs/extended01/crs/trace/ocssd.trc.

1

2

3

4

5

6

7

[grid@extended04 ~]$  crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

1. ONLINE   0f6da8ca07774f84bf4f3b118a754d52 (/dev/mapper/s1ocr01) [OCRC1]

2. ONLINE   ded2df4f62df4f91bf9a5028ecaa4c5b (/vote3nd/vote_nfs_3) [OCRC1]

3. ONLINE   8917739853ce4fa3bf30d120235d0131 (/dev/mapper/s2ocr05) [OCRC1]

Located 3 voting disk(s).


到这里,基本上常规的操作就差不多了。但是,如果有这样的需求,比如将所有的voting disk全部从ASM移动到cluster filesystem或者从cluster filesystem移动到ASM该怎么做呢?事实上,如开始测试的那样,对于已经在ASM上有voting disk,是不可以通过add 方式来添加voting file到文件系统上的。

如前文所说我这里有个误区,我以为要事先dd准备好文件。事实是不需要的。直接指定名字,replace即可。那个文件是replace的时候,css自己去格式化和创建的。


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

[grid@extended04 vote3nd]$ ls -lrt

total 512020

drwx------ 2 grid oinstall     16384 Mar 16 12:33 lost+found

-rw-rw---- 1 grid oinstall 524288000 Mar 18 13:14 vote_nfs_3

[grid@extended04 vote3nd]$ crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

1. ONLINE   719a3e0d02cc4f9ebf15b3fb93eefae6 (/dev/mapper/s1ocr01) [OCRC1]

2. ONLINE   6eb9cd75dd3f4fb0bf10bc5b1a0622bc (/dev/mapper/s2ocr05) [OCRC1]

3. ONLINE   d313f9f412954f0dbf10b867f99004a9 (/vote3nd/vote_nfs_3) [OCRC1]

Located 3 voting disk(s).

[grid@extended04 vote3nd]$ crsctl replace votedisk /vote3nd/vote_nfs_1

Now formatting voting disk: /vote3nd/vote_nfs_1.

CRS-4256: Updating the profile

Successful addition of voting disk 43d23d3dccd44f96bfc9bf1af3e11491.

Successful deletion of voting disk 719a3e0d02cc4f9ebf15b3fb93eefae6.

Successful deletion of voting disk 6eb9cd75dd3f4fb0bf10bc5b1a0622bc.

Successful deletion of voting disk d313f9f412954f0dbf10b867f99004a9.

CRS-4256: Updating the profile

CRS-4266: Voting file(s) successfully replaced

[grid@extended04 vote3nd]$ ls -lrt

total 532500

drwx------ 2 grid oinstall     16384 Mar 16 12:33 lost+found

-rw-rw---- 1 grid oinstall 524288000 Mar 18 13:15 vote_nfs_3

-rw-r----- 1 grid oinstall  20972032 Mar 18 13:15 vote_nfs_1

[grid@extended04 vote3nd]$ crsctl query css votedisk

##  STATE    File Universal Id                File Name Disk group

--  -----    -----------------                --------- ---------

1. ONLINE   43d23d3dccd44f96bfc9bf1af3e11491 (/vote3nd/vote_nfs_1) []

Located 1 voting disk(s).

[grid@extended04 vote3nd]$ crsctl replace votedisk +OCRC1

CRS-4256: Updating the profile

Successful addition of voting disk 82cf4c6cbf354fd5bf07327a6744ec04.

Successful addition of voting disk a50370fa387d4fcabf6e89e2b1e652e4.

Successful addition of voting disk f9ee438278394f7bbf7e504943990bd5.

Successful deletion of voting disk 43d23d3dccd44f96bfc9bf1af3e11491.

Successfully replaced voting disk group with +OCRC1.

CRS-4256: Updating the profile

CRS-4266: Voting file(s) successfully replaced


这里的vote_nfs_3是一个asm disk是一个voting disk,vote_nfs_1是一个voting file。可通过kfed看出。

vote_nfs_3:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

[grid@extended04 vote3nd]$ kfed read /vote3nd/vote_nfs_3

kfbh.endian:                          1 ; 0x000: 0x01

kfbh.hard:                          130 ; 0x001: 0x82

kfbh.type:                            1 ; 0x002: KFBTYP_DISKHEAD

kfbh.datfmt:                          1 ; 0x003: 0x01

kfbh.block.blk:                       0 ; 0x004: blk=0

kfbh.block.obj:              2147483657 ; 0x008: disk=9

kfbh.check:                  1521144041 ; 0x00c: 0x5aaad0e9

kfbh.fcn.base:                    20771 ; 0x010: 0x00005123

kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000

kfbh.spare1:                          0 ; 0x018: 0x00000000

kfbh.spare2:                          0 ; 0x01c: 0x00000000

kfdhdb.driver.provstr:         ORCLDISK ; 0x000: length=8

kfdhdb.driver.reserved[0]:            0 ; 0x008: 0x00000000

kfdhdb.driver.reserved[1]:            0 ; 0x00c: 0x00000000

kfdhdb.driver.reserved[2]:            0 ; 0x010: 0x00000000

kfdhdb.driver.reserved[3]:            0 ; 0x014: 0x00000000

kfdhdb.driver.reserved[4]:            0 ; 0x018: 0x00000000

kfdhdb.driver.reserved[5]:            0 ; 0x01c: 0x00000000

kfdhdb.compat:                318767104 ; 0x020: 0x13000000

kfdhdb.dsknum:                        9 ; 0x024: 0x0009

kfdhdb.grptyp:                        2 ; 0x026: KFDGTP_NORMAL

kfdhdb.hdrsts:                        3 ; 0x027: KFDHDR_MEMBER

vote_nfs_1:

1

2

3

4

5

6

7

8

9

10

11

12

[grid@extended04 vote3nd]$ kfed read /vote3nd/vote_nfs_1

kfbh.endian:                          0 ; 0x000: 0x00

kfbh.hard:                           34 ; 0x001: 0x22

kfbh.type:                            0 ; 0x002: KFBTYP_INVALID

kfbh.datfmt:                          0 ; 0x003: 0x00

kfbh.block.blk:              4290772992 ; 0x004: blk=2143289344 (indirect)

kfbh.block.obj:                       0 ; 0x008: file=0

kfbh.check:                           0 ; 0x00c: 0x00000000

kfbh.fcn.base:                        0 ; 0x010: 0x00000000

kfbh.fcn.wrap:                      512 ; 0x014: 0x00000200

kfbh.spare1:                      40960 ; 0x018: 0x0000a000

kfbh.spare2:                 2054913149 ; 0x01c: 0x7a7b7c7d