psql: FATAL: the database system is starting up

From: Tom K <tomkcpr(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: psql: FATAL: the database system is starting up
Date: 2019-05-28 04:59:23
Message-ID: CAE3EmBDX+TRLot5Ar6fJwV_QazSrbQ8JmWb9ZhGtj9wgtu7c_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey Guy's,

I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
RHEL 7.6 VM's. Every now and then the underlying storage crashes taking
out the cluster. On recovery, PostgreSQL tends to come up while other
databases just blow up. That is until now.

After the most recent crash 2-3 weeks ago, the cluster is now running into
this message but I'm not able to make heads or tails out of why it's
throwing this:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf2019-05-23 08:26:34.561 EDT
[10101] LOG: listening on IPv4 address "192.168.0.108", port
54322019-05-23 08:26:34.565 EDT [10101] LOG: listening on Unix socket
"./.s.PGSQL.5432"2019-05-23 08:26:34.591 EDT [10101] LOG: redirecting
log output to logging collector process2019-05-23 08:26:34.591 EDT
[10101] HINT: Future log output will appear in directory
"log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=1002019-05-23
08:26:45.707 EDT [10105] LOG: listening on IPv4 address
"192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG:
listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT
[10105] LOG: redirecting log output to logging collector
process2019-05-23 08:26:45.740 EDT [10105] HINT: Future log output
will appear in directory "log".^C-bash-4.2$ /usr/pgsql-10/bin/postgres
-D /data/patroni --config-file=/data/patroni/postgresql.conf
--hot_standby=on --listen_addresses=192.168.0.1
--max_worker_processes=8 --max_locks_per_transaction=64
--wal_level=replica --cluster_name=postgres --wal_log_hints=on
--max_wal_senders=10 --track_commit_timestamp=off
--max_prepared_transactions=0 --port=5432 --max_replication_slots=10
--max_connections=1002019-05-23 08:29:36.014 EDT [10188] LOG:
listening on IPv4 address "192.168.0.108", port 54322019-05-23
08:29:36.017 EDT [10188] LOG: listening on Unix socket
"./.s.PGSQL.5432"2019-05-23 08:29:36.040 EDT [10188] LOG: redirecting
log output to logging collector process2019-05-23 08:29:36.040 EDT
[10188] HINT: Future log output will appear in directory "log".

The database logs only print out this:

2019-05-23 08:25:44.476 EDT [10088] HINT: The database server will
regularly poll the pg_wal subdirectory to check for files placed
there.2019-05-23 08:25:44.477 EDT [10088] LOG: entering standby
mode2019-05-23 08:25:50.209 EDT [10086] LOG: received fast shutdown
request2019-05-23 08:25:50.215 EDT [10089] LOG: shutting
down2019-05-23 08:25:50.221 EDT [10086] LOG: database system is shut
down2019-05-23 08:26:11.560 EDT [10094] LOG: database system was shut
down in recovery at 2019-05-23 08:25:50 EDT2019-05-23 08:26:11.560 EDT
[10094] WARNING: recovery command file "recovery.conf" specified
neither primary_conninfo nor restore_command2019-05-23 08:26:11.560
EDT [10094] HINT: The database server will regularly poll the pg_wal
subdirectory to check for files placed there.2019-05-23 08:26:11.562
EDT [10094] LOG: entering standby mode2019-05-23 08:26:17.817 EDT
[10092] LOG: received fast shutdown request2019-05-23 08:26:17.824
EDT [10095] LOG: shutting down2019-05-23 08:26:17.831 EDT [10092]
LOG: database system is shut down2019-05-23 08:26:34.596 EDT [10103]
LOG: database system was shut down in recovery at 2019-05-23 08:26:17
EDT2019-05-23 08:26:34.596 EDT [10103] WARNING: recovery command file
"recovery.conf" specified neither primary_conninfo nor
restore_command2019-05-23 08:26:34.596 EDT [10103] HINT: The database
server will regularly poll the pg_wal subdirectory to check for files
placed there.2019-05-23 08:26:34.598 EDT [10103] LOG: entering
standby mode2019-05-23 08:26:43.025 EDT [10101] LOG: received fast
shutdown request2019-05-23 08:26:43.032 EDT [10104] LOG: shutting
down2019-05-23 08:26:43.038 EDT [10101] LOG: database system is shut
down2019-05-23 08:26:45.748 EDT [10107] LOG: database system was shut
down in recovery at 2019-05-23 08:26:43 EDT2019-05-23 08:26:45.748 EDT
[10107] WARNING: recovery command file "recovery.conf" specified
neither primary_conninfo nor restore_command2019-05-23 08:26:45.748
EDT [10107] HINT: The database server will regularly poll the pg_wal
subdirectory to check for files placed there.

and when connecting, I get this cryptic message:

[root(at)psql01 log]# psql -U postgres -h 192.168.0.108
psql: FATAL: the database system is starting up[root(at)psql01 log]#

There's no space issue here. The cluster stays in recovery. This state
doesn't change and the DB is inaccessible.

So I ran the above in debug mode ( -d 5 ) and got this:

DEBUG: mmap(148897792) with MAP_HUGETLB failed, huge pages disabled:
Cannot allocate memory

so I freed memory on the underlying physical this VM sits on but no luck.
Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes) I
get the message:

[root(at)psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root 2217 2188 0 00:38 pts/1 00:00:00 tail -f postgresql-Thu.log
postgres 2512 1 4 00:42 ? 00:00:01 /usr/bin/python2
/bin/patroni /etc/patroni.yml
postgres 2533 1 0 00:42 ? 00:00:00
/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100
postgres 2535 2533 0 00:42 ? 00:00:00 postgres: postgres:
logger process
postgres 2536 2533 0 00:42 ? 00:00:00 postgres: postgres:
startup process waiting for 000000010000000000000008
root 2664 2039 0 00:42 pts/0 00:00:00 grep --color=auto -Ei
patroni|postgres[root(at)psql01 ~]#

The system prints "postgres: postgres: startup process waiting for
000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace on
the file reveals that it's looking for the following:

read(5, 0x7fff9cb4eb87, 1) = -1 EAGAIN (Resource
temporarily unavailable)read(5, 0x7fff9cb4eb87, 1) = -1
EAGAIN (Resource temporarily
unavailable)open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No
such file or directory)
epoll_create1(EPOLL_CLOEXEC) = 3

But this file doesn't exist in the directory on this particular cluster
node (psql01). I copy the missing history files off another cluster node
to this one and it does move forward but just get's stuck on another
history file. But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02). But this
yields this error:

PANIC: replication checkpoint has wrong magic 0 instead of 307747550

and I'm not able to find very much to move me forward on this one either.

So again I give up after a few hours of trying to recover psql02 and move
on to the third node. On the third cluster node (pgsql03) I get this error
which I'm also not able to find much info on:

FATAL: syntax error in history file: f2W

Hoping I'm able to recover the data from at least one of the nodes, make it
primary, and get it to replicate to the others.

Thx, TK

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2019-05-28 13:53:32 Re: psql: FATAL: the database system is starting up
Previous Message Julie Nishimura 2019-05-27 17:58:57 Re: POSTGRES_FSM_RELATIONS CRITICAL: DB control fsm relations used: 79569 of 80000 (99%)