etcd failed on master node

From: daur exp <dauren(dot)naipov(at)gmail(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: etcd failed on master node
Date: 2025-02-07 09:38:06
Message-ID: CA+BrJtNYrWX-ark=O4BwmceZEAfQ8p=pGfnxvVazQjHesXA5FA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello Community,
I have 3 nodes with in etcd cluster, 2 of them postrges db with in patroni
cluster, 3rd node just etcd member. 3 nodes located on 3 different data
centers.
I did clone of vm of 1st node after that on it etcd crached somehow. I
think due to disk overloaded during cloning of vm. But patroni, postgres is
ok, working.

I see such error on etcd:
Feb 06 15:44:52 prod-pgsql01-uv01 bash[1832370]: *panic: tocommit(43265370)
is out of range [lastIndex(43231911)]. Was the raft log corrupted,
truncated, or lost?*

root(at)prod-pgsql01-uv01:/etc/etcd# systemctl status etcd
× etcd.service - Etcd Server
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; preset:
disabled)
Active: failed (Result: exit-code) since Thu 2025-02-06 14:31:50 +05;
8min ago
Duration: 6month 4w 1d 9h 17min 14.044s
Process: 1823702 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc)
/usr/bin/etcd (code=exited, status=2)
Main PID: 1823702 (code=exited, status=2)
CPU: 1.668s

Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Scheduled
restart job, restart counter is at 5.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Stopped Etcd Server.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Consumed
1.668s CPU time.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Start request
repeated too quickly.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: etcd.service: Failed with
result 'exit-code'.
Feb 06 14:31:50 prod-pgsql01-uv01. systemd[1]: Failed to start Etcd Server.
root(at)prod-pgsql01-uv01:/etc/etcd# etcdctl member list
{"level":"warn","ts":"2025-02-06T14:40:17.959419+0500","logger":"etcd-client","caller":"v3(at)v3(dot)5(dot)12/retry_interceptor.go:62","msg":"retrying
of unary invoker failed","target":"etcd-endpoints://0xc0000d8700/
127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded
desc = latest balancer error: last connection error: connection error: desc
= \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect:
connection refused\""}
Error: context deadline exceeded

root(at)prod-pgsql01-uv01:/var/lib/etcd/postgresql/member# patronictl -c
/etc/patroni/patroni.yml list
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=1, connect=None,
read=None, redirect=0, status=None)) after connection broken by
'NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7fe63fdf3610>: Failed to establish a new connection: [Errno 111]
Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - WARNING - Retrying (Retry(total=0, connect=None,
read=None, redirect=0, status=None)) after connection broken by
'NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7fe63fdf3910>: Failed to establish a new connection: [Errno 111]
Connection refused')': /v2/machines
2025-02-07 14:25:57,387 - ERROR - Failed to get list of machines from
http://10.0.100.29:2379/v2:
MaxRetryError("HTTPConnectionPool(host='10.0.100.29', port=2379): Max
retries exceeded with url: /v2/machines (Caused by
NewConnectionError('<urllib3.connection.HTTPConnection object at
0x7fe63fdf3a60>: Failed to establish a new connection: [Errno 111]
Connection refused'))")
^C
===========================================================
But on replica output is ok and like it:

root(at)stb-pgsql02-uv02:/home/ansible# patronictl -c /etc/patroni/patroni.yml
list
+ Cluster: patroni-cluster (7389499945753108875)
---------+-----------+----+-----------+
| Member | Host | Role |
State | TL | Lag in MB |
+----------------------------------------------+-------------+---------+-----------+----+-----------+
| prod-pgsql01-uv01 | 10.0.100.29 | Leader | running | 4 | |
| stb-pgsql02-uv02 | 10.0.100.31 | Replica | streaming | 4 | 0 |
+----------------------------------------------+-------------+---------+-----------+----+-----------+
root(at)stb-pgsql02-uv02:/home/ansible# etcdctl member list
164fa7c8b348f043, started, node1, http://10.0.100.29:2380,
http://10.0.100.29:2379, false
5b78829cdf24f062, started, node2, http://10.0.100.31:2380,
http://10.0.100.31:2379, false
b209a2d81cc1c996, started, node3, http://10.0.225.203:2380,
http://10.0.225.203:2379, false

Not sure if i do switchover on replica to do master completes succesfully.
May happen split brain or crash patroni cluster.
if i can do on node2:
patronictl -c /etc/patroni/patroni.yml switchover --candidate node2
then on node1:
systemctl stop etcd
rm -rf /var/lib/etcd/member/snap/*
rm -rf /var/lib/etcd/member/wal/*
systemctl start etcd
will help me?

I need your advice how to restore etcd cluster with in patroni cluster?

--
Regards, Dauren

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Ribe 2025-02-07 13:30:08 Re: etcd failed on master node
Previous Message Raphael Salguero Aragón 2025-02-07 07:21:50 Re: Postgresql replication failed in Patroni