GaussDB T分布式集群故障恢复案例:CN隔离恢复
墨天轮原文链接:https://www.modb.pro/db/22373
背景说明:
虚拟机环境,一套4节点的GaussDB T 1.0.1分布式集群,由于想升级至1.0.2,配置python3时,误删除某个主机的/usr/bin/目录,导致整个节点主机异常。
恢复/usr/bin目录之后,该主机集群状态异常。CM、ETCD、DN状态为离线OFFLINE,,CN状态为DELETED。
误删/usr/bin/目录如何恢复,这里不作介绍,大致流程是新建一个虚拟机,注意vg名称不要和旧的一样,卸载故障主机的磁盘,挂载到新虚拟机,拷贝一个好的/usr/bin/到然后卸载,再挂载回原来的故障主机。
CN 隔离恢复注意事项:
● 不支持恢复由于物理损坏而被隔离并替换的故障CN。
● 在故障CN恢复期间,不允许进行DDL操作。
● 如果集群中有负载均衡组件,需要从负载均衡组件中剔除对故障CN的业务分发。在故障恢复后,再在负载均衡组件中恢复CN。
● 如果集群中某个CN节点一直处于route_conflict,同时在故障CN以外的某个CN或主DN上,SYS_DATA_NODES系统表中没有配置故障的CN节点, 也适用手动恢复操作。
以下是恢复过程如下:
/usr/bin目录恢复后,查看集群状态。。CM、ETCD、DN状态为离线OFFLINE,,CN状态为DELETED。
[omm@gsdb11 ~]$ gs_om -t statusSet output to terminal.2020-03-09 11:22:23.772 [error] instance (AZ1/gsdb11/ETCD1): get error(etcdserver: key is not provided) when get status, offline2020-03-09 11:22:23.780 [error] instance (AZ1/gsdb11/CM1): get error(etcdserver: key is not provided) when get status, offline--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------az_state : single_az
cluster_state : Degraded
balanced : false----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------AZ:AZ1 ROLE:primary STATUS:ONLINE ---------------------------------------------------------------------Host Status----------------------------------------------------------------------HOST:gsdb11 AZ:AZ1 STATUS:ONLINE IP:192.168.179.126HOST:gsdb12 AZ:AZ1 STATUS:ONLINE IP:192.168.179.127HOST:gsdb13 AZ:AZ1 STATUS:ONLINE IP:192.168.179.128HOST:gsdb14 AZ:AZ1 STATUS:ONLINE IP:192.168.179.129----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------INSTANCE:CM1 ROLE:slave STATUS:OFFLINE HOST:gsdb11 ID:601INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:gsdb12 ID:602INSTANCE:CM3 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:603INSTANCE:CM4 ROLE:slave STATUS:ONLINE HOST:gsdb14 ID:604---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------INSTANCE:ETCD1 ROLE:backup STATUS:OFFLINE HOST:gsdb11 ID:701 PORT:2379 DataDir:/u01/gaussdb/data/etcdINSTANCE:ETCD2 ROLE:follower STATUS:ONLINE HOST:gsdb12 ID:702 PORT:2379 DataDir:/u01/gaussdb/data/etcdINSTANCE:ETCD3 ROLE:leader STATUS:ONLINE HOST:gsdb13 ID:703 PORT:2379 DataDir:/u01/gaussdb/data/etcd----------------------------------------------------------------------CN Status-----------------------------------------------------------------------INSTANCE:cn_401 ROLE:no role STATUS:DELETED HOST:gsdb11 ID:401 PORT:8000 DataDir:/u01/gaussdb/data/cnINSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:gsdb12 ID:402 PORT:8000 DataDir:/u01/gaussdb/data/cnINSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:gsdb13 ID:403 PORT:8000 DataDir:/u01/gaussdb/data/cnINSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:gsdb14 ID:404 PORT:8000 DataDir:/u01/gaussdb/data/cn---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------INSTANCE:DB1_1 ROLE:standby STATUS:OFFLINE HOST:gsdb11 ID:1 PORT:40000 DataDir:/u01/gaussdb/data/dn1INSTANCE:DB1_2 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:2 PORT:40021 DataDir:/u01/gaussdb/data/dn1---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:3 PORT:40000 DataDir:/u01/gaussdb/data/dn2INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:gsdb13 ID:4 PORT:40021 DataDir:/u01/gaussdb/data/dn2---------------------------------------------------------Instances Status in Group (group_3)----------------------------------------------------------INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:5 PORT:40000 DataDir:/u01/gaussdb/data/dn3INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:gsdb14 ID:6 PORT:40021 DataDir:/u01/gaussdb/data/dn3---------------------------------------------------------Instances Status in Group (group_4)----------------------------------------------------------INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:8 PORT:40021 DataDir:/u01/gaussdb/data/dn4INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:gsdb14 ID:7 PORT:40000 DataDir:/u01/gaussdb/data/dn4-----------------------------------------------------------------------Manage IP----------------------------------------------------------------------HOST:gsdb11 IP:192.168.179.126HOST:gsdb12 IP:192.168.179.127HOST:gsdb13 IP:192.168.179.128HOST:gsdb14 IP:192.168.179.129-------------------------------------------------------------------Query Action Info------------------------------------------------------------------HOSTNAME: gsdb11 TIME: 2020-03-09 11:22:23.862783------------------------------------------------------------------------Float Ip------------------------------------------------------------------HOST:gsdb14 DB4_7:192.168.179.129 IP:
HOST:gsdb13 DB3_5:192.168.179.128 IP:
HOST:gsdb12 DB2_3:192.168.179.127 IP:
HOST:gsdb12 DB1_2:192.168.179.127 IP:
[omm@gsdb11 ~]$
[omm@gsdb11 ~]$
CN 隔离恢复
注意:执行命令恢复时所在目录不能是待恢复CN的数据目录。
[omm@gsdb11 ~]$
[omm@gsdb11 ~]$ gs_om -t recoverycnStart to recovery cn.Get deleted cn.Check deleted cn datadir and backup dir.
Successfully check deleted cn datadir and backup dir.Close cm update node route.Close the CM heartbeat about CN.Add node info of deleted CNs on other instances.
Successfully add node info of deleted CNs on other instances.Restore deleted CNs.
Successfully restore deleted CNs.
Handle the pending dist trans of deleted CNs.
..................18s
Completed to handle the pending dist trans of deleted CNs.Backup the deleted CNs. ..
........158s
Successfully backup the deleted CNs.
Reconstruct deleted CNs. ...
......96s
Successfully reconstruct deleted CNs.Export metadata from original instance.Export metadata from original instance cn_404. .Su
ccessfully export metadata from original instance.Import metadata into deleted CNs.
Successfully import metadata into deleted CNs.Start deleted CNs.
Successfully start deleted CNs.Open the CM heartbeat about CN.Close cm update node route.
Successfully recovery cn.
[omm@gsdb11 ~]$
[omm@gsdb11 ~]$
恢复后,集群正常。
[omm@gsdb11 ~]$
[omm@gsdb11 ~]$ gs_om -t statusSet output to terminal.--------------------------------------------------------------------Cluster Status--------------------------------------------------------------------az_state : single_az
cluster_state : Normalbalanced : false----------------------------------------------------------------------AZ Status-----------------------------------------------------------------------AZ:AZ1 ROLE:primary STATUS:ONLINE ---------------------------------------------------------------------Host Status----------------------------------------------------------------------HOST:gsdb11 AZ:AZ1 STATUS:ONLINE IP:192.168.179.126HOST:gsdb12 AZ:AZ1 STATUS:ONLINE IP:192.168.179.127HOST:gsdb13 AZ:AZ1 STATUS:ONLINE IP:192.168.179.128HOST:gsdb14 AZ:AZ1 STATUS:ONLINE IP:192.168.179.129----------------------------------------------------------------Cluster Manager Status----------------------------------------------------------------INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:gsdb11 ID:601INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:gsdb12 ID:602INSTANCE:CM3 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:603INSTANCE:CM4 ROLE:slave STATUS:ONLINE HOST:gsdb14 ID:604---------------------------------------------------------------------ETCD Status----------------------------------------------------------------------INSTANCE:ETCD1 ROLE:follower STATUS:ONLINE HOST:gsdb11 ID:701 PORT:2379 DataDir:/u01/gaussdb/data/etcdINSTANCE:ETCD2 ROLE:follower STATUS:ONLINE HOST:gsdb12 ID:702 PORT:2379 DataDir:/u01/gaussdb/data/etcdINSTANCE:ETCD3 ROLE:leader STATUS:ONLINE HOST:gsdb13 ID:703 PORT:2379 DataDir:/u01/gaussdb/data/etcd----------------------------------------------------------------------CN Status-----------------------------------------------------------------------INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:gsdb11 ID:401 PORT:8000 DataDir:/u01/gaussdb/data/cnINSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:gsdb12 ID:402 PORT:8000 DataDir:/u01/gaussdb/data/cnINSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:gsdb13 ID:403 PORT:8000 DataDir:/u01/gaussdb/data/cnINSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:gsdb14 ID:404 PORT:8000 DataDir:/u01/gaussdb/data/cn---------------------------------------------------------Instances Status in Group (group_1)----------------------------------------------------------INSTANCE:DB1_1 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:1 PORT:40000 DataDir:/u01/gaussdb/data/dn1INSTANCE:DB1_2 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:2 PORT:40021 DataDir:/u01/gaussdb/data/dn1---------------------------------------------------------Instances Status in Group (group_2)----------------------------------------------------------INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:3 PORT:40000 DataDir:/u01/gaussdb/data/dn2INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:gsdb13 ID:4 PORT:40021 DataDir:/u01/gaussdb/data/dn2---------------------------------------------------------Instances Status in Group (group_3)----------------------------------------------------------INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:5 PORT:40000 DataDir:/u01/gaussdb/data/dn3INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:gsdb14 ID:6 PORT:40021 DataDir:/u01/gaussdb/data/dn3---------------------------------------------------------Instances Status in Group (group_4)----------------------------------------------------------INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:8 PORT:40021 DataDir:/u01/gaussdb/data/dn4INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:gsdb14 ID:7 PORT:40000 DataDir:/u01/gaussdb/data/dn4-----------------------------------------------------------------------Manage IP----------------------------------------------------------------------HOST:gsdb11 IP:192.168.179.126HOST:gsdb12 IP:192.168.179.127HOST:gsdb13 IP:192.168.179.128HOST:gsdb14 IP:192.168.179.129-------------------------------------------------------------------Query Action Info------------------------------------------------------------------HOSTNAME: gsdb11 TIME: 2020-03-09 11:36:15.686140------------------------------------------------------------------------Float Ip------------------------------------------------------------------HOST:gsdb14 DB4_7:192.168.179.129 IP:
HOST:gsdb13 DB3_5:192.168.179.128 IP:
HOST:gsdb12 DB2_3:192.168.179.127 IP:
HOST:gsdb12 DB1_2:192.168.179.127 IP:
[omm@gsdb11 ~]$
恢复后存在的疑问
进行CN隔离恢复之后,发现该CN的omm用户密码变成默认的gaussdb_123了。。。难道是跟CN隔离恢复有关。。。这个有待咨询华为。。
如下,CN1恢复是从CN4上导出的数据恢复的,那么按理应该omm密码会和CN4上的一致,但是确变成默认的密码了。
[omm@gsdb14 ~]$ zsql omm/yhadmin_123@192.168.179.129:8000 -q
connected.
SQL> select instance_name ,status from v$instance;
INSTANCE_NAME STATUS
-------------------- --------------------cn_404 OPEN
1 rows fetched.
SQL> exit
[omm@gsdb14 ~]$
CN1恢复之后,omm密码变成默认值了。。
[omm@gsdb11 ~]$ zsql omm/yhadmin_123@192.168.179.126:8000 -q
GS-00329, Incorrect user or password
[omm@gsdb11 ~]$
[omm@gsdb11 ~]$
[omm@gsdb11 ~]$ zsql omm/gaussdb_123@192.168.179.126:8000 -q
connected.
SQL> select instance_name,status from v$instance;
INSTANCE_NAME STATUS
-------------------- --------------------cn_401 OPEN
1 rows fetched.
SQL> exit
[omm@gsdb11 ~]$
华为回复,此问题是一个bug,已在1.0.2版本修复。