Re: psql: FATAL: the database system is starting up

From: Tom K <tomkcpr(at)gmail(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: psql: FATAL: the database system is starting up
Date: 2019-06-01 02:53:35
Message-ID: CAE3EmBAfMa=eA1LqBB0zf-kZwBPTbRhzY5wUxcPYNQ8hathDQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, May 29, 2019 at 10:28 AM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:

> On 5/28/19 6:59 PM, Tom K wrote:
> >
> >
> > On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com
> > <mailto:adrian(dot)klaver(at)aklaver(dot)com>> wrote:
> >
>
> >
> > Correct. Master election occurs through Patroni. WAL level is set to:
> >
> > wal_level = 'replica'
> >
> > So no archiving.
> >
> >
> >
> > >
> > > After the most recent crash 2-3 weeks ago, the cluster is now
> > running
> > > into this message but I'm not able to make heads or tails out of
> why
> > > it's throwing this:
> >
> > So you have not been able to run the cluster the past 2-3 weeks or is
> > that more recent?
> >
> >
> > Haven't been able to bring this PostgresSQL cluster up ( run the cluster
> > ) since 2-3 weeks ago. Tried quite a few combinations of options to
> > recover this. No luck. Had storage failures earlier, even with
> > corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
> > come up each time without any recovery effort on my part.
> >
> >
> > When you refer to history files below are you talking about WAL
> > files or
> > something else?
> >
> > Is this:
> >
> > "recovery command file "recovery.conf" specified neither
> > primary_conninfo nor restore_command"
> >
> > true?
> >
> >
> > True. recovery.conf is controlled by Patroni. Contents of this file
> > remained the same for all the cluster nodes with the exception of the
> > primary_slot_name:
> >
> > [root(at)psql01 postgresql-patroni-etcd]# cat recovery.conf
> > primary_slot_name = 'postgresql0'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root(at)psql01 postgresql-patroni-etcd]#
> >
> > [root(at)psql02 postgres-backup]# cat recovery.conf
> > primary_slot_name = 'postgresql1'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root(at)psql02 postgres-backup]#
> >
> > [root(at)psql03 postgresql-patroni-backup]# cat recovery.conf
> > primary_slot_name = 'postgresql2'
> > standby_mode = 'on'
> > recovery_target_timeline = 'latest'
> > [root(at)psql03 postgresql-patroni-backup]#
> >
> > I've made a copy of the root postgres directory over to another location
> > so when troubleshooting, I can always revert to the first state the
> > cluster was in when it failed.
>
> I have no experience with Patroni so I will be of no help there. You
> might get more useful information from:
>
> https://github.com/zalando/patroni
> Community
>
> There are two places to connect with the Patroni community: on github,
> via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
> you're using Patroni, or just interested, please join us.
>

Will post there as well. Thank you. My thinking was to post here first
since I suspect the Patroni community will simply refer me back here given
that the PostgreSQL errors are originating directly from PostgreSQL.

>
> That being said, can you start the copied Postgres instance without
> using the Patroni instrumentation?
>

Yes, that is something I have been trying to do actually. But I hit a dead
end with the three errors above.

So what I did is to copy a single node's backed up copy of the data files
to */data/patroni* of the same node ( this is the psql data directory as
defined through patroni ) of the same node then ran this ( psql03 =
192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

This resulted in one of the 3 messages above. Hence the post here. If I
can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize all
the databases then restore the dump.

Using the above procedure I get one of three error messages when using the
data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.

>
> >
> > Thx,
> > TK
> >
>
>
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Morris de Oryx 2019-06-01 07:44:00 Re: Questions about btree_gin vs btree_gist for low cardinality columns
Previous Message Morris de Oryx 2019-06-01 02:52:29 Re: Questions about btree_gin vs btree_gist for low cardinality columns