Re: psql: FATAL: the database system is starting up

From: Tom K <tomkcpr(at)gmail(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: psql: FATAL: the database system is starting up
Date: 2019-05-29 01:59:33
Message-ID: CAE3EmBC5biD+xPEHfGWg0iuTojhWvJrREnSh-3AViQmOsOx9ZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:

> On 5/27/19 9:59 PM, Tom K wrote:
> > Hey Guy's,
> >
> >
> > I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
> > RHEL 7.6 VM's. Every now and then the underlying storage crashes taking
> > out the cluster. On recovery, PostgreSQL tends to come up while other
> > databases just blow up. That is until now.
>
> The above is one master and two standby servers connected via streaming
> replication, correct?

> Is there WAL archiving set up?
>

Correct. Master election occurs through Patroni. WAL level is set to:

wal_level = 'replica'

So no archiving.

>
>
> >
> > After the most recent crash 2-3 weeks ago, the cluster is now running
> > into this message but I'm not able to make heads or tails out of why
> > it's throwing this:
>
> So you have not been able to run the cluster the past 2-3 weeks or is
> that more recent?
>

Haven't been able to bring this PostgresSQL cluster up ( run the cluster )
since 2-3 weeks ago. Tried quite a few combinations of options to recover
this. No luck. Had storage failures earlier, even with corrupted OS
files, but this PostgreSQL cluster w/ Patroni was able to come up each time
without any recovery effort on my part.

>
> When you refer to history files below are you talking about WAL files or
> something else?
>
> Is this:
>
> "recovery command file "recovery.conf" specified neither
> primary_conninfo nor restore_command"
>
> true?
>

True. recovery.conf is controlled by Patroni. Contents of this file
remained the same for all the cluster nodes with the exception of the
primary_slot_name:

[root(at)psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root(at)psql01 postgresql-patroni-etcd]#

[root(at)psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root(at)psql02 postgres-backup]#

[root(at)psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root(at)psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location so
when troubleshooting, I can always revert to the first state the cluster
was in when it failed.

Thx,
TK

>
>
>
> >
> > |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
> > [10101]LOG:listening onIPv4 address "192.168.0.108",port
> > 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
> > "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
> > output tologging collector process 2019-05-2308:26:34.591EDT
> > [10101]HINT:Future log output will appear indirectory
> > "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf --hot_standby=on
> > --listen_addresses=192.168.0.108 --max_worker_processes=8
> > --max_locks_per_transaction=64 --wal_level=replica
> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> > --max_replication_slots=10
> > --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
> > onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
> > [10105]LOG:listening onUnix socket
> > "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
> > output tologging collector process 2019-05-2308:26:45.740EDT
> > [10105]HINT:Future log output will appear indirectory
> > "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf --hot_standby=on
> > --listen_addresses=192.168.0.1 --max_worker_processes=8
> > --max_locks_per_transaction=64 --wal_level=replica
> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> > --max_replication_slots=10
> > --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
> > onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
> > [10188]LOG:listening onUnix socket
> > "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
> > output tologging collector process 2019-05-2308:29:36.040EDT
> > [10188]HINT:Future log output will appear indirectory "log".|
> >
> > The database logs only print out this:
> >
> > |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
> > poll the pg_wal subdirectory tocheckforfiles placed
> > there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
> > 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
> > 2019-05-2308:25:50.215EDT [10089]LOG:shutting down
> > 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
> > 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
> > inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
> > [10094]WARNING:recovery command file"recovery.conf"specified neither
> > primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
> > [10094]HINT:The databaseserver will regularly poll the pg_wal
> > subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
> > [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
> > [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
> > [10095]LOG:shutting down 2019-05-2308:26:17.831EDT
> > [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
> > [10103]LOG:databasesystem was shut down inrecovery at
> > 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
> > command file"recovery.conf"specified neither primary_conninfo nor
> > restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
> > will regularly poll the pg_wal subdirectory tocheckforfiles placed
> > there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
> > 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
> > 2019-05-2308:26:43.032EDT [10104]LOG:shutting down
> > 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
> > 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
> > inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
> > [10107]WARNING:recovery command file"recovery.conf"specified neither
> > primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
> > [10107]HINT:The databaseserver will regularly poll the pg_wal
> > subdirectory tocheckforfiles placed there.|
> >
> > and when connecting, I get this cryptic message:
> >
> > |[root(at)psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
> > databasesystem isstarting up [root(at)psql01 log]#|
> >
> > There's no space issue here. The cluster stays in recovery. This state
> > doesn't change and the DB is inaccessible.
> >
> > So I ran the above in debug mode ( -d 5 ) and got this:
> >
> >
> > |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
> > allocate memory|
> >
> > so I freed memory on the underlying physical this VM sits on but no
> > luck. Same memory error and same result as above.
> >
> > When patroni is started on just this one node (troubleshooting purposes)
> > I get the message:
> >
> > |[root(at)psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
> > 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
> > 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
> > postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf --hot_standby=on
> > --listen_addresses=192.168.0.108 --max_worker_processes=8
> > --max_locks_per_transaction=64 --wal_level=replica
> > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> > --max_replication_slots=10 --max_connections=100postgres
> > 25352533000:42?00:00:00postgres:postgres:logger process postgres
> > 25362533000:42?00:00:00postgres:postgres:startup process waiting
> > for000000010000000000000008root 26642039000:42pts/000:00:00grep
> > --color=auto -Ei patroni|postgres[root(at)psql01 ~]#|
> >
> > The system prints "postgres: postgres: startup process waiting for
> > 000000010000000000000008"
> >
> > Seems it's trying to replay the history log but can't? Running an strace
> > on the file reveals that it's looking for the following:
> >
> > |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> > unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> > unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
> > fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
> >
> > But this file doesn't exist in the directory on this particular cluster
> > node (psql01). I copy the missing history files off another cluster
> > node to this one and it does move forward but just get's stuck on
> > another history file. But same error message as above on psql01.
> >
> > So I give up and try to recover the secondary node (psql02). But this
> > yields this error:
> >
> > |PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
> >
> > and I'm not able to find very much to move me forward on this one either.
> >
> > So again I give up after a few hours of trying to recover psql02 and
> > move on to the third node. On the third cluster node (pgsql03) I get
> > this error which I'm also not able to find much info on:
> >
> > |FATAL:syntax error inhistory file:f2W|
> >
> > Hoping I'm able to recover the data from at least one of the nodes, make
> > it primary, and get it to replicate to the others.
> >
> > Thx, TK
> >
>
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Pawan Sharma 2019-05-29 08:03:54
Previous Message Adrian Klaver 2019-05-29 00:16:27 Re: POSTGRES_FSM_RELATIONS CRITICAL: DB control fsm relations used: 79569 of 80000 (99%)