takashi kono's blog

コーヒーとキーボードと共に何かを記録していくブログ

CentOS 8 + MariaDB 10.5 の 双方向レプリケーション + 1 read only replica で 1 台 Main が落ちた時のシュミレーション その2

目次

Con1 の MAIN の mariadb プロセスが落ちた時の対応をやってみる

前回の記事

takashikono.hatenadiary.jp

やってみる

手順

一旦こんな感じでやってみる

  1. radius0 で mariadb の main pid を確認する
  2. radius0 で mariadbkill -9 する
  3. Connection Con1 の REPLICA を STOP する
  4. radius1 で INSERT して、データを更新する
  5. radius0 で mariadb の余計な子プロセスがあれば kill する
  6. 念の為 radius1 でデータのバックアップを行う
  7. radius0 で mariadb を起動する
  8. radius0 で Connection Con2 の REPLICA として復活させる
  9. エラーがでたら トラブルシュートする
  10. Connection Con1 の REPLICA を復活させる
  11. エラーがでたらトラブルシュートする

radius0 で mariadb の main pid を調べる

sudo systemctl status mariadb.service | grep -i "main pid"

log

[ope@radius0 ~]$ sudo systemctl status mariadb.service | grep -i "main pid"
 Main PID: 38262 (mariadbd)
[ope@radius0 ~]$

radius0 で mariadbkill -9 する

sudo kill -9 <PID>

sudo kill -9 38262

log

[ope@radius0 ~]$ sudo kill -9 38262
[ope@radius0 ~]$
[ope@radius0 ~]$ sudo systemctl status mariadb.service | grep -i "main pid"
 Main PID: 38262 (code=killed, signal=KILL); Control PID: 38816 ((sh))
[ope@radius0 ~]$
[ope@radius0 ~]$ sudo systemctl status mariadb.service
● mariadb.service - MariaDB 10.5.9 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor pre>
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: active (running) since Tue 2021-08-17 13:50:57 JST; 4s ago
     Docs: man:mariadbd(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 38867 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_ST>
  Process: 38816 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && V>
  Process: 38812 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_STA>
 Main PID: 38850 (mariadbd)
   Status: "Taking your SQL requests now..."
    Tasks: 18 (limit: 12415)
   Memory: 70.2M
   CGroup: /system.slice/mariadb.service
           └─38850 /usr/sbin/mariadbd

 817 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 0 [Note] Reading >
 817 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 0 [Note] Added ne>
 817 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 0 [Note] /usr/sbi>
 817 13:50:57 radius0 mariadbd[38850]: Version: '10.5.9-MariaDB-log'  socket>
 817 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 4 [Note] Master '>
 8月 17 13:50:57 radius0 systemd[1]: Started MariaDB 10.5.9 database server.
 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 6 [Warning] IP ad>
 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 7 [Warning] IP ad>
 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 7 [Note] Start bi>
 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 6 [Note] Start bi>
[ope@radius0 ~]$

なんだと?生き返った
自動起動設定のせいかも?

[ope@radius0 ~]$ sudo systemctl is-
is-active          is-enabled         is-failed          is-system-running
[ope@radius0 ~]$ sudo systemctl is-enabled mariadb.service
enabled
[ope@radius0 ~]$ sudo systemctl disable mariadb.service
Removed /etc/systemd/system/multi-user.target.wants/mariadb.service.
[ope@radius0 ~]$ sudo systemctl is-enabled mariadb.service
disabled
[ope@radius0 ~]$
[ope@radius0 ~]$ sudo systemctl status mariadb.service | grep -i "main pid"
 Main PID: 38850 (mariadbd)
[ope@radius0 ~]$
[ope@radius0 ~]$ sudo kill 38850
[ope@radius0 ~]$
[ope@radius0 ~]$
[ope@radius0 ~]$ sudo systemctl status mariadb.service
● mariadb.service - MariaDB 10.5.9 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor pr>
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: inactive (dead)
     Docs: man:mariadbd(8)
           https://mariadb.com/kb/en/library/systemd/

 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 4 [Note] Master '>
 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 4 [Note] Master '>
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: >
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: >
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: >
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: >
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: >
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: >
 817 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] /usr/sbi>
 817 13:57:58 radius0 systemd[1]: mariadb.service: Succeeded.
[ope@radius0 ~]$

止まった

Connection Con1 の REPLICA を STOP する

SHOW REPLICA 'Con1' STATUS \G

STOP REPLICA 'Con1' ;

SHOW REPLICA 'Con1' STATUS \G

radius1

root@radius1 [(none)] (13:59:29)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State: Reconnecting after a failed master event read
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000003
           Read_Master_Log_Pos: 344
                Relay_Log_File: radius1-relay-bin-con1.000003
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000003
              Slave_IO_Running: Connecting
             Slave_SQL_Running: Yes
...
                 Last_IO_Errno: 2003
                 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10  maximum-retries: 86400  message: Can't connect to server on '192.168.100.102' (111 "Connection refused")
...
root@radius1 [(none)] (13:59:38)
> STOP REPLICA 'Con1' ;
Query OK, 0 rows affected (0.132 sec)

root@radius1 [(none)] (14:00:29)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000003
           Read_Master_Log_Pos: 344
                Relay_Log_File: radius1-relay-bin-con1.000003
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000003
              Slave_IO_Running: No
             Slave_SQL_Running: No
...
                 Last_IO_Errno: 2003
                 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10  maximum-retries: 86400  message: Can't connect to server on '192.168.100.102' (111 "Connection refused")
...

slave-db0

root@slave-db0 [(none)] (14:01:30)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State: Reconnecting after a failed master event read
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000003
           Read_Master_Log_Pos: 344
                Relay_Log_File: slave-db0-relay-bin-con1.000003
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000003
              Slave_IO_Running: Connecting
             Slave_SQL_Running: Yes
...
                 Last_IO_Errno: 2003
                 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10  maximum-retries: 86400  message: Can't connect to server on '192.168.100.102' (111 "Connection refused")
...
root@slave-db0 [(none)] (14:02:43)
> STOP REPLICA 'Con1' ;
Query OK, 0 rows affected (0.279 sec)

root@slave-db0 [(none)] (14:02:44)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000003
           Read_Master_Log_Pos: 344
                Relay_Log_File: slave-db0-relay-bin-con1.000003
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000003
              Slave_IO_Running: No
             Slave_SQL_Running: No
...
                 Last_IO_Errno: 2003
                 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10  maximum-retries: 86400  message: Can't connect to server on '192.168.100.102' (111 "Connection refused")
...

とめた

radius1 で INSERT して、データを更新する

USE radius;
SELECT * FROM radcheck ;

INSERT INTO radcheck 
(username, attribute, op, value)
VALUES
('foo5', 'Cleartext-Password', ':=', 'foo5pass')
;

SELECT * FROM radcheck ;

log

root@radius1 [(none)] (14:03:25)
> USE radius;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
root@radius1 [radius] (14:03:28)
> SELECT * FROM radcheck ;
+----+----------+--------------------+----+----------+
| id | username | attribute          | op | value    |
+----+----------+--------------------+----+----------+
|  1 | foo1     | Cleartext-Password | := | foo1pass |
|  2 | foo2     | Cleartext-Password | := | foo2pass |
|  3 | foo3     | Cleartext-Password | := | foo3pass |
|  4 | foo4     | Cleartext-Password | := | foo4pass |
+----+----------+--------------------+----+----------+
4 rows in set (0.002 sec)

root@radius1 [radius] (14:03:32)
> INSERT INTO radcheck
    -> (username, attribute, op, value)
    -> VALUES
    -> ('foo5', 'Cleartext-Password', ':=', 'foo5pass')
    -> ;
Query OK, 1 row affected (0.016 sec)

root@radius1 [radius] (14:03:39)
> SELECT * FROM radcheck ;
+----+----------+--------------------+----+----------+
| id | username | attribute          | op | value    |
+----+----------+--------------------+----+----------+
|  1 | foo1     | Cleartext-Password | := | foo1pass |
|  2 | foo2     | Cleartext-Password | := | foo2pass |
|  3 | foo3     | Cleartext-Password | := | foo3pass |
|  4 | foo4     | Cleartext-Password | := | foo4pass |
|  6 | foo5     | Cleartext-Password | := | foo5pass |
+----+----------+--------------------+----+----------+
5 rows in set (0.001 sec)

root@radius1 [radius] (14:03:42)
>

radius0 で mariadb の余計な子プロセスがあれば kill する

ps aux | grep -i -e "mariadb" -e "mysql"

systemctl status mariadb.service | grep -i pid

kill <PID>

log

[ope@radius0 ~]$ ps aux | grep -i -e "mariadb" -e "mysql"
ope        38989  0.0  0.0  10284  1040 pts/0    S+   14:04   0:00 grep --color=auto -i -e mariadb -e mysql
[ope@radius0 ~]$
[ope@radius0 ~]$ systemctl status mariadb.service | grep -i pid
[ope@radius0 ~]$

念の為 radius1 でデータのバックアップを行う

ls -tlr 

mysqldump --defaults-extra-file=mariadb_root.cnf --all-databases \
--flush-logs --single-transaction --master-data=2 > \
all_databases_on_master_$(date +%F_%H-%M-%S).sql

log

[ope@radius1 ~]$ ls -tlr
合計 2744
-rw-rw-r--. 1 ope ope      70  411 17:41 mariadb_radius.cnf
-rw-rw-r--. 1 ope ope 1397837  811 14:37 all_databases_on_master_2021-08-11_14-36-18.sql
-rw-rw-r--. 1 ope ope       0  816 16:52 all_databases_on_master_2021-08-16_16-52-47.sql
-rw-rw-r--. 1 ope ope 1398151  816 16:54 all_databases_on_master_2021-08-16_16-53-59.sql
-rw-rw-r--. 1 ope ope     106  816 17:19 mariadb_root.cnf
[ope@radius1 ~]$
[ope@radius1 ~]$ mysqldump --defaults-extra-file=mariadb_root.cnf --all-databases \
> --flush-logs --single-transaction --master-data=2 > \
> all_databases_on_master_$(date +%F_%H-%M-%S).sql
[ope@radius1 ~]$
[ope@radius1 ~]$ ls -tlr
合計 4112
-rw-rw-r--. 1 ope ope      70  411 17:41 mariadb_radius.cnf
-rw-rw-r--. 1 ope ope 1397837  811 14:37 all_databases_on_master_2021-08-11_14-36-18.sql
-rw-rw-r--. 1 ope ope       0  816 16:52 all_databases_on_master_2021-08-16_16-52-47.sql
-rw-rw-r--. 1 ope ope 1398151  816 16:54 all_databases_on_master_2021-08-16_16-53-59.sql
-rw-rw-r--. 1 ope ope       0  817 14:05 all_databases_on_master_2021-08-17_14-05-04.sql
-rw-rw-r--. 1 ope ope     108  817 14:05 mariadb_root.cnf
-rw-rw-r--. 1 ope ope 1398199  817 14:05 all_databases_on_master_2021-08-17_14-05-20.sql
[ope@radius1 ~]$

all_databases_on_master_2021-08-17_14-05-20.sql backup done

radius0 で mariadb を起動する

sudo systemctl enable mariadb.service
sudo systemctl is-enabled mariadb.service

sudo systemctl restart mariadb.service

systemctl status mariadb.service

log

[ope@radius0 ~]$ sudo systemctl enable mariadb.service
Created symlink /etc/systemd/system/multi-user.target.wants/mariadb.service → /usr/lib/systemd/system/mariadb.service.
[ope@radius0 ~]$ sudo systemctl is-enabled mariadb.service
enabled
[ope@radius0 ~]$ sudo systemctl restart mariadb.service
[ope@radius0 ~]$ systemctl status mariadb.service
● mariadb.service - MariaDB 10.5.9 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor pre>
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: active (running) since Tue 2021-08-17 14:07:42 JST; 2s ago
     Docs: man:mariadbd(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 39077 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_ST>
  Process: 39026 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && V>
  Process: 39024 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_STA>
 Main PID: 39060 (mariadbd)
   Status: "Taking your SQL requests now..."
    Tasks: 16 (limit: 12415)
   Memory: 70.9M
   CGroup: /system.slice/mariadb.service
           └─39060 /usr/sbin/mariadbd
[ope@radius0 ~]$

サービスが起動した

radius0 で Connection Con2 の REPLICA として復活させる

まず状態確認

SHOW REPLICA 'Con2' STATUS \G

log

root@radius0 [(none)] (14:08:32)
> SHOW REPLICA 'Con2' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.100.103
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius1-bin.000003
           Read_Master_Log_Pos: 389
                Relay_Log_File: radius0-relay-bin-con2.000004
                 Relay_Log_Pos: 690
         Relay_Master_Log_File: radius1-bin.000003
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...
root@radius0 [(none)] (14:11:36)
> SELECT * FROM radius.radcheck;
+----+----------+--------------------+----+----------+
| id | username | attribute          | op | value    |
+----+----------+--------------------+----+----------+
|  1 | foo1     | Cleartext-Password | := | foo1pass |
|  2 | foo2     | Cleartext-Password | := | foo2pass |
|  3 | foo3     | Cleartext-Password | := | foo3pass |
|  4 | foo4     | Cleartext-Password | := | foo4pass |
|  6 | foo5     | Cleartext-Password | := | foo5pass |
+----+----------+--------------------+----+----------+
5 rows in set (0.001 sec)

root@radius0 [(none)] (14:13:13)
>

同期できた
kill で落としても同じだった
書き込み回数とかが物を言うのだろうか?

Connection Con1 の REPLICA を復活させる

SHOW REPLICA 'Con1' STATUS \G

START REPLICA 'Con1' ;

SHOW REPLICA 'Con1' STATUS \G

radius1 log

root@radius1 [(none)] (14:10:58)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000003
           Read_Master_Log_Pos: 344
                Relay_Log_File: radius1-relay-bin-con1.000003
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000003
              Slave_IO_Running: No
             Slave_SQL_Running: No
...
root@radius1 [(none)] (14:11:49)
> START REPLICA 'Con1' ;
Query OK, 0 rows affected (0.060 sec)

root@radius1 [(none)] (14:11:54)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000004
           Read_Master_Log_Pos: 344
                Relay_Log_File: radius1-relay-bin-con1.000002
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000004
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
...

slave-db0 log

root@slave-db0 [(none)] (14:12:18)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000003
           Read_Master_Log_Pos: 344
                Relay_Log_File: slave-db0-relay-bin-con1.000003
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000003
              Slave_IO_Running: No
             Slave_SQL_Running: No
...
root@slave-db0 [(none)] (14:12:23)
> START REPLICA 'Con1' ;
Query OK, 0 rows affected (0.075 sec)

root@slave-db0 [(none)] (14:12:39)
> SHOW REPLICA 'Con1' STATUS \G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 192.168.100.102
                   Master_User: replication_user
                   Master_Port: 3306
                 Connect_Retry: 10
               Master_Log_File: radius0-bin.000004
           Read_Master_Log_Pos: 344
                Relay_Log_File: slave-db0-relay-bin-con1.000002
                 Relay_Log_Pos: 645
         Relay_Master_Log_File: radius0-bin.000004
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
...

まとめ

kill で強制的に落としてみたけど、プロセスを立ち上げたときに Connection Con2 の REPLICA として同期が取れ、そのままうまくいってしまった
本番でもここまで簡単なら良いのだが、実際はサーバの調達に時間かかったりして、 binlog がなくなっているかもしれない。そうなったら backup とってリストアして、 RESET してとかいろいろしなきゃならなそう
いったんここまで