目次
Con1 の MAIN の mariadb プロセスが落ちた時の対応をやってみる
前回の記事
やってみる
手順
一旦こんな感じでやってみる
- radius0 で mariadb の main pid を確認する
- radius0 で mariadb を
kill -9
する - Connection
Con1
の REPLICA を STOP する - radius1 で INSERT して、データを更新する
- radius0 で mariadb の余計な子プロセスがあれば
kill
する - 念の為 radius1 でデータのバックアップを行う
- radius0 で mariadb を起動する
- radius0 で Connection
Con2
の REPLICA として復活させる - エラーがでたら トラブルシュートする
- Connection
Con1
の REPLICA を復活させる - エラーがでたらトラブルシュートする
radius0 で mariadb の main pid を調べる
sudo systemctl status mariadb.service | grep -i "main pid"
log
[ope@radius0 ~]$ sudo systemctl status mariadb.service | grep -i "main pid" Main PID: 38262 (mariadbd) [ope@radius0 ~]$
radius0 で mariadb を kill -9
する
sudo kill -9 <PID> sudo kill -9 38262
log
[ope@radius0 ~]$ sudo kill -9 38262 [ope@radius0 ~]$ [ope@radius0 ~]$ sudo systemctl status mariadb.service | grep -i "main pid" Main PID: 38262 (code=killed, signal=KILL); Control PID: 38816 ((sh)) [ope@radius0 ~]$ [ope@radius0 ~]$ sudo systemctl status mariadb.service ● mariadb.service - MariaDB 10.5.9 database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor pre> Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: active (running) since Tue 2021-08-17 13:50:57 JST; 4s ago Docs: man:mariadbd(8) https://mariadb.com/kb/en/library/systemd/ Process: 38867 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_ST> Process: 38816 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && V> Process: 38812 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_STA> Main PID: 38850 (mariadbd) Status: "Taking your SQL requests now..." Tasks: 18 (limit: 12415) Memory: 70.2M CGroup: /system.slice/mariadb.service └─38850 /usr/sbin/mariadbd 8月 17 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 0 [Note] Reading > 8月 17 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 0 [Note] Added ne> 8月 17 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 0 [Note] /usr/sbi> 8月 17 13:50:57 radius0 mariadbd[38850]: Version: '10.5.9-MariaDB-log' socket> 8月 17 13:50:57 radius0 mariadbd[38850]: 2021-08-17 13:50:57 4 [Note] Master '> 8月 17 13:50:57 radius0 systemd[1]: Started MariaDB 10.5.9 database server. 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 6 [Warning] IP ad> 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 7 [Warning] IP ad> 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 7 [Note] Start bi> 8月 17 13:51:00 radius0 mariadbd[38850]: 2021-08-17 13:51:00 6 [Note] Start bi> [ope@radius0 ~]$
なんだと?生き返った
自動起動設定のせいかも?
[ope@radius0 ~]$ sudo systemctl is- is-active is-enabled is-failed is-system-running [ope@radius0 ~]$ sudo systemctl is-enabled mariadb.service enabled [ope@radius0 ~]$ sudo systemctl disable mariadb.service Removed /etc/systemd/system/multi-user.target.wants/mariadb.service. [ope@radius0 ~]$ sudo systemctl is-enabled mariadb.service disabled [ope@radius0 ~]$ [ope@radius0 ~]$ sudo systemctl status mariadb.service | grep -i "main pid" Main PID: 38850 (mariadbd) [ope@radius0 ~]$ [ope@radius0 ~]$ sudo kill 38850 [ope@radius0 ~]$ [ope@radius0 ~]$ [ope@radius0 ~]$ sudo systemctl status mariadb.service ● mariadb.service - MariaDB 10.5.9 database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor pr> Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: inactive (dead) Docs: man:mariadbd(8) https://mariadb.com/kb/en/library/systemd/ 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 4 [Note] Master '> 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 4 [Note] Master '> 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: > 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: > 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: > 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: > 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: > 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] InnoDB: > 8月 17 13:57:58 radius0 mariadbd[38850]: 2021-08-17 13:57:58 0 [Note] /usr/sbi> 8月 17 13:57:58 radius0 systemd[1]: mariadb.service: Succeeded. [ope@radius0 ~]$
止まった
Connection Con1
の REPLICA を STOP する
SHOW REPLICA 'Con1' STATUS \G STOP REPLICA 'Con1' ; SHOW REPLICA 'Con1' STATUS \G
radius1
root@radius1 [(none)] (13:59:29) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Reconnecting after a failed master event read Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000003 Read_Master_Log_Pos: 344 Relay_Log_File: radius1-relay-bin-con1.000003 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000003 Slave_IO_Running: Connecting Slave_SQL_Running: Yes ... Last_IO_Errno: 2003 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10 maximum-retries: 86400 message: Can't connect to server on '192.168.100.102' (111 "Connection refused") ... root@radius1 [(none)] (13:59:38) > STOP REPLICA 'Con1' ; Query OK, 0 rows affected (0.132 sec) root@radius1 [(none)] (14:00:29) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000003 Read_Master_Log_Pos: 344 Relay_Log_File: radius1-relay-bin-con1.000003 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000003 Slave_IO_Running: No Slave_SQL_Running: No ... Last_IO_Errno: 2003 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10 maximum-retries: 86400 message: Can't connect to server on '192.168.100.102' (111 "Connection refused") ...
slave-db0
root@slave-db0 [(none)] (14:01:30) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Reconnecting after a failed master event read Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000003 Read_Master_Log_Pos: 344 Relay_Log_File: slave-db0-relay-bin-con1.000003 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000003 Slave_IO_Running: Connecting Slave_SQL_Running: Yes ... Last_IO_Errno: 2003 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10 maximum-retries: 86400 message: Can't connect to server on '192.168.100.102' (111 "Connection refused") ... root@slave-db0 [(none)] (14:02:43) > STOP REPLICA 'Con1' ; Query OK, 0 rows affected (0.279 sec) root@slave-db0 [(none)] (14:02:44) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000003 Read_Master_Log_Pos: 344 Relay_Log_File: slave-db0-relay-bin-con1.000003 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000003 Slave_IO_Running: No Slave_SQL_Running: No ... Last_IO_Errno: 2003 Last_IO_Error: error reconnecting to master 'replication_user@192.168.100.102:3306' - retry-time: 10 maximum-retries: 86400 message: Can't connect to server on '192.168.100.102' (111 "Connection refused") ...
とめた
radius1 で INSERT して、データを更新する
USE radius; SELECT * FROM radcheck ; INSERT INTO radcheck (username, attribute, op, value) VALUES ('foo5', 'Cleartext-Password', ':=', 'foo5pass') ; SELECT * FROM radcheck ;
log
root@radius1 [(none)] (14:03:25) > USE radius; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed root@radius1 [radius] (14:03:28) > SELECT * FROM radcheck ; +----+----------+--------------------+----+----------+ | id | username | attribute | op | value | +----+----------+--------------------+----+----------+ | 1 | foo1 | Cleartext-Password | := | foo1pass | | 2 | foo2 | Cleartext-Password | := | foo2pass | | 3 | foo3 | Cleartext-Password | := | foo3pass | | 4 | foo4 | Cleartext-Password | := | foo4pass | +----+----------+--------------------+----+----------+ 4 rows in set (0.002 sec) root@radius1 [radius] (14:03:32) > INSERT INTO radcheck -> (username, attribute, op, value) -> VALUES -> ('foo5', 'Cleartext-Password', ':=', 'foo5pass') -> ; Query OK, 1 row affected (0.016 sec) root@radius1 [radius] (14:03:39) > SELECT * FROM radcheck ; +----+----------+--------------------+----+----------+ | id | username | attribute | op | value | +----+----------+--------------------+----+----------+ | 1 | foo1 | Cleartext-Password | := | foo1pass | | 2 | foo2 | Cleartext-Password | := | foo2pass | | 3 | foo3 | Cleartext-Password | := | foo3pass | | 4 | foo4 | Cleartext-Password | := | foo4pass | | 6 | foo5 | Cleartext-Password | := | foo5pass | +----+----------+--------------------+----+----------+ 5 rows in set (0.001 sec) root@radius1 [radius] (14:03:42) >
radius0 で mariadb の余計な子プロセスがあれば kill
する
ps aux | grep -i -e "mariadb" -e "mysql" systemctl status mariadb.service | grep -i pid kill <PID>
log
[ope@radius0 ~]$ ps aux | grep -i -e "mariadb" -e "mysql" ope 38989 0.0 0.0 10284 1040 pts/0 S+ 14:04 0:00 grep --color=auto -i -e mariadb -e mysql [ope@radius0 ~]$ [ope@radius0 ~]$ systemctl status mariadb.service | grep -i pid [ope@radius0 ~]$
念の為 radius1 でデータのバックアップを行う
ls -tlr mysqldump --defaults-extra-file=mariadb_root.cnf --all-databases \ --flush-logs --single-transaction --master-data=2 > \ all_databases_on_master_$(date +%F_%H-%M-%S).sql
log
[ope@radius1 ~]$ ls -tlr 合計 2744 -rw-rw-r--. 1 ope ope 70 4月 11 17:41 mariadb_radius.cnf -rw-rw-r--. 1 ope ope 1397837 8月 11 14:37 all_databases_on_master_2021-08-11_14-36-18.sql -rw-rw-r--. 1 ope ope 0 8月 16 16:52 all_databases_on_master_2021-08-16_16-52-47.sql -rw-rw-r--. 1 ope ope 1398151 8月 16 16:54 all_databases_on_master_2021-08-16_16-53-59.sql -rw-rw-r--. 1 ope ope 106 8月 16 17:19 mariadb_root.cnf [ope@radius1 ~]$ [ope@radius1 ~]$ mysqldump --defaults-extra-file=mariadb_root.cnf --all-databases \ > --flush-logs --single-transaction --master-data=2 > \ > all_databases_on_master_$(date +%F_%H-%M-%S).sql [ope@radius1 ~]$ [ope@radius1 ~]$ ls -tlr 合計 4112 -rw-rw-r--. 1 ope ope 70 4月 11 17:41 mariadb_radius.cnf -rw-rw-r--. 1 ope ope 1397837 8月 11 14:37 all_databases_on_master_2021-08-11_14-36-18.sql -rw-rw-r--. 1 ope ope 0 8月 16 16:52 all_databases_on_master_2021-08-16_16-52-47.sql -rw-rw-r--. 1 ope ope 1398151 8月 16 16:54 all_databases_on_master_2021-08-16_16-53-59.sql -rw-rw-r--. 1 ope ope 0 8月 17 14:05 all_databases_on_master_2021-08-17_14-05-04.sql -rw-rw-r--. 1 ope ope 108 8月 17 14:05 mariadb_root.cnf -rw-rw-r--. 1 ope ope 1398199 8月 17 14:05 all_databases_on_master_2021-08-17_14-05-20.sql [ope@radius1 ~]$
all_databases_on_master_2021-08-17_14-05-20.sql
backup done
radius0 で mariadb を起動する
sudo systemctl enable mariadb.service sudo systemctl is-enabled mariadb.service sudo systemctl restart mariadb.service systemctl status mariadb.service
log
[ope@radius0 ~]$ sudo systemctl enable mariadb.service Created symlink /etc/systemd/system/multi-user.target.wants/mariadb.service → /usr/lib/systemd/system/mariadb.service. [ope@radius0 ~]$ sudo systemctl is-enabled mariadb.service enabled [ope@radius0 ~]$ sudo systemctl restart mariadb.service [ope@radius0 ~]$ systemctl status mariadb.service ● mariadb.service - MariaDB 10.5.9 database server Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor pre> Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: active (running) since Tue 2021-08-17 14:07:42 JST; 2s ago Docs: man:mariadbd(8) https://mariadb.com/kb/en/library/systemd/ Process: 39077 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_ST> Process: 39026 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && V> Process: 39024 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_STA> Main PID: 39060 (mariadbd) Status: "Taking your SQL requests now..." Tasks: 16 (limit: 12415) Memory: 70.9M CGroup: /system.slice/mariadb.service └─39060 /usr/sbin/mariadbd [ope@radius0 ~]$
サービスが起動した
radius0 で Connection Con2
の REPLICA として復活させる
まず状態確認
SHOW REPLICA 'Con2' STATUS \G
log
root@radius0 [(none)] (14:08:32) > SHOW REPLICA 'Con2' STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.100.103 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius1-bin.000003 Read_Master_Log_Pos: 389 Relay_Log_File: radius0-relay-bin-con2.000004 Relay_Log_Pos: 690 Relay_Master_Log_File: radius1-bin.000003 Slave_IO_Running: Yes Slave_SQL_Running: Yes ... root@radius0 [(none)] (14:11:36) > SELECT * FROM radius.radcheck; +----+----------+--------------------+----+----------+ | id | username | attribute | op | value | +----+----------+--------------------+----+----------+ | 1 | foo1 | Cleartext-Password | := | foo1pass | | 2 | foo2 | Cleartext-Password | := | foo2pass | | 3 | foo3 | Cleartext-Password | := | foo3pass | | 4 | foo4 | Cleartext-Password | := | foo4pass | | 6 | foo5 | Cleartext-Password | := | foo5pass | +----+----------+--------------------+----+----------+ 5 rows in set (0.001 sec) root@radius0 [(none)] (14:13:13) >
同期できた
kill
で落としても同じだった
書き込み回数とかが物を言うのだろうか?
Connection Con1
の REPLICA を復活させる
SHOW REPLICA 'Con1' STATUS \G START REPLICA 'Con1' ; SHOW REPLICA 'Con1' STATUS \G
radius1 log
root@radius1 [(none)] (14:10:58) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000003 Read_Master_Log_Pos: 344 Relay_Log_File: radius1-relay-bin-con1.000003 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000003 Slave_IO_Running: No Slave_SQL_Running: No ... root@radius1 [(none)] (14:11:49) > START REPLICA 'Con1' ; Query OK, 0 rows affected (0.060 sec) root@radius1 [(none)] (14:11:54) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000004 Read_Master_Log_Pos: 344 Relay_Log_File: radius1-relay-bin-con1.000002 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes ...
slave-db0 log
root@slave-db0 [(none)] (14:12:18) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000003 Read_Master_Log_Pos: 344 Relay_Log_File: slave-db0-relay-bin-con1.000003 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000003 Slave_IO_Running: No Slave_SQL_Running: No ... root@slave-db0 [(none)] (14:12:23) > START REPLICA 'Con1' ; Query OK, 0 rows affected (0.075 sec) root@slave-db0 [(none)] (14:12:39) > SHOW REPLICA 'Con1' STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.100.102 Master_User: replication_user Master_Port: 3306 Connect_Retry: 10 Master_Log_File: radius0-bin.000004 Read_Master_Log_Pos: 344 Relay_Log_File: slave-db0-relay-bin-con1.000002 Relay_Log_Pos: 645 Relay_Master_Log_File: radius0-bin.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: ...
まとめ
kill
で強制的に落としてみたけど、プロセスを立ち上げたときに Connection Con2
の REPLICA として同期が取れ、そのままうまくいってしまった
本番でもここまで簡単なら良いのだが、実際はサーバの調達に時間かかったりして、 binlog がなくなっているかもしれない。そうなったら backup とってリストアして、 RESET してとかいろいろしなきゃならなそう
いったんここまで