Re: psql: FATAL: the database system is starting up

From: Tom K <tomkcpr(at)gmail(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: psql: FATAL: the database system is starting up
Date: 2019-06-02 00:32:46
Message-ID: CAE3EmBAh7iaEXvuN1G=LH4U3ScxPgzLWZrwxv-pVAUXQmo6s=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Jun 1, 2019 at 8:25 PM Tom K <tomkcpr(at)gmail(dot)com> wrote:

> So the best bet will be trying to get through this error then:
>
> [ PSQL02 ]
> PANIC:replication check point has wrong magic 0 instead of 307747550
>
>
>
>
> On Sat, Jun 1, 2019 at 8:21 PM Tom K <tomkcpr(at)gmail(dot)com> wrote:
>
>>
>>
>> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
>> wrote:
>>
>>> On 6/1/19 4:22 PM, Tom K wrote:
>>> >
>>> >
>>>
>>> >
>>> > Looks like this crash was far more catastrophic then I thought. By
>>> the
>>> > looks of things, thinking on psql02 would be my best bet.
>>> >
>>>
>>> The more I look at it the more I think the replication was not doing
>>> what you thought it was doing. That psql02 was the primary and that
>>> psql01 and psql03 where out of sync and/or defunct standbys.
>>>
>>
>> Now that I look at the files myself, that's the conclusion I was coming
>> to myself. Sample config:
>>
>> [root(at)psql02 base]# cat /etc/patroni.yml
>> scope: postgres
>> namespace: /db/
>> name: postgresql1
>>
>> restapi:
>> listen: 192.168.0.124:8008
>> connect_address: 192.168.0.124:8008
>>
>> etcd:
>> host: 192.168.0.124:2379
>>
>> bootstrap:
>> dcs:
>> ttl: 30
>> loop_wait: 10
>> retry_timeout: 10
>> maximum_lag_on_failover: 1048576
>> postgresql:
>> use_pg_rewind: true
>>
>> initdb:
>> - encoding: UTF8
>> - data-checksums
>>
>> pg_hba:
>> - host replication replicator 127.0.0.1/32 md5
>> - host replication replicator 192.168.0.108/0 md5
>> - host replication replicator 192.168.0.124/0 md5
>> - host replication replicator 192.168.0.118/0 md5
>> - host all all 0.0.0.0/0 md5
>>
>> users:
>> admin:
>> password: admin
>> options:
>> - createrole
>> - createdb
>>
>> postgresql:
>> listen: 192.168.0.124:5432
>> bin_dir: /usr/pgsql-10/bin
>> connect_address: 192.168.0.124:5432
>> data_dir: /data/patroni
>> pgpass: /tmp/pgpass
>> unix_socket_directories: /data/patroni
>> authentication:
>> replication:
>> username: replicator
>> password: rep-pass
>> superuser:
>> username: postgres
>> password: <SECRET>
>> parameters:
>> unix_socket_directories: '.'
>>
>> tags:
>> nofailover: false
>> noloadbalance: false
>> clonefrom: false
>> nosync: false
>> [root(at)psql02 base]#
>>
>> Or perhaps when the system crashed, the filesystem check simply moved the
>> folders out due to corruption.
>>
>>
>>>
>>>
Trying what we did above but on the second node:

2019-06-01 20:28:45.305 EDT [21745] LOG: database system was interrupted;
last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:28:46.142 EDT [21745] DEBUG: primary_slot_name =
'postgresql1'
2019-06-01 20:28:46.142 EDT [21745] DEBUG: standby_mode = 'on'
2019-06-01 20:28:46.142 EDT [21745] DEBUG: recovery_target_timeline =
latest
2019-06-01 20:28:46.142 EDT [21745] WARNING: recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 20:28:46.142 EDT [21745] HINT: The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 20:28:46.142 EDT [21745] LOG: entering standby mode
2019-06-01 20:28:46.142 EDT [21745] LOG: invalid record length at
0/4C35CDF8: wanted 24, got 0
2019-06-01 20:28:46.142 EDT [21745] LOG: invalid primary checkpoint record
2019-06-01 20:28:46.142 EDT [21745] LOG: using previous checkpoint record
at 0/4C34EDA8
2019-06-01 20:28:46.142 EDT [21745] DEBUG: redo record is at 0/4C34ED70;
shutdown FALSE
2019-06-01 20:28:46.143 EDT [21745] DEBUG: next transaction ID: 0:1409831;
next OID: 237578
2019-06-01 20:28:46.143 EDT [21745] DEBUG: next MultiXactId: 48; next
MultiXactOffset: 174
2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest unfrozen transaction ID:
549, in database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest MultiXactId: 1, in
database 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: commit timestamp Xid
oldest/newest: 0/0
2019-06-01 20:28:46.143 EDT [21745] DEBUG: transaction ID wrap limit is
2147484196, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: MultiXactId wrap limit is
2147483648, limited by database with OID 1
2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication slots
2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication origin
progress state
2019-06-01 20:28:46.143 EDT [21745] PANIC: replication checkpoint has
wrong magic 0 instead of 307747550
2019-06-01 20:28:46.143 EDT [21743] DEBUG: reaping dead processes
2019-06-01 20:28:46.143 EDT [21743] LOG: startup process (PID 21745) was
terminated by signal 6: Aborted
2019-06-01 20:28:46.143 EDT [21743] LOG: aborting startup due to startup
process failure
2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 0
before_shmem_exit callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 5 on_shmem_exit
callbacks to make
2019-06-01 20:28:46.144 EDT [21743] DEBUG: cleaning up dynamic shared
memory control segment with ID 222930115
2019-06-01 20:28:46.145 EDT [21743] DEBUG: proc_exit(1): 2 callbacks to
make
2019-06-01 20:28:46.146 EDT [21743] LOG: database system is shut down
2019-06-01 20:28:46.146 EDT [21743] DEBUG: exit(1)
2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 20:28:46.146 EDT [21743] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: logger shutting down
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 on_shmem_exit
callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(0): 0 callbacks to
make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: exit(0)
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(-1): 0 callbacks to
make
^C
-bash-4.2$

Removing the recovery.conf and restarting yields the following messages:

2019-06-01 20:31:12.231 EDT [21910] LOG: database system was interrupted;
last known up at 2019-04-28 06:06:24 EDT
2019-06-01 20:31:12.505 EDT [21910] LOG: invalid record length at
0/4C35CDF8: wanted 24, got 0
2019-06-01 20:31:12.505 EDT [21910] LOG: invalid primary checkpoint record
2019-06-01 20:31:12.505 EDT [21910] LOG: using previous checkpoint record
at 0/4C34EDA8
2019-06-01 20:31:12.505 EDT [21910] DEBUG: redo record is at 0/4C34ED70;
shutdown FALSE
2019-06-01 20:31:12.505 EDT [21910] DEBUG: next transaction ID: 0:1409831;
next OID: 237578
2019-06-01 20:31:12.505 EDT [21910] DEBUG: next MultiXactId: 48; next
MultiXactOffset: 174
2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest unfrozen transaction ID:
549, in database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest MultiXactId: 1, in
database 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: commit timestamp Xid
oldest/newest: 0/0
2019-06-01 20:31:12.505 EDT [21910] DEBUG: transaction ID wrap limit is
2147484196, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: MultiXactId wrap limit is
2147483648, limited by database with OID 1
2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication slots
2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication origin
progress state
2019-06-01 20:31:12.505 EDT [21910] PANIC: replication checkpoint has
wrong magic 0 instead of 307747550
2019-06-01 20:31:12.506 EDT [21908] DEBUG: reaping dead processes
2019-06-01 20:31:12.506 EDT [21908] LOG: startup process (PID 21910) was
terminated by signal 6: Aborted
2019-06-01 20:31:12.506 EDT [21908] LOG: aborting startup due to startup
process failure
2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 0
before_shmem_exit callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 5 on_shmem_exit
callbacks to make
2019-06-01 20:31:12.506 EDT [21908] DEBUG: cleaning up dynamic shared
memory control segment with ID 976986759
2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(1): 2 callbacks to
make
2019-06-01 20:31:12.508 EDT [21908] LOG: database system is shut down
2019-06-01 20:31:12.508 EDT [21908] DEBUG: exit(1)
2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: logger shutting down
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 on_shmem_exit
callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(0): 0 callbacks to
make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: exit(0)
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(-1): 0 callbacks to
make

> --
>>> Adrian Klaver
>>> adrian(dot)klaver(at)aklaver(dot)com
>>>
>>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2019-06-02 00:36:33 Re: psql: FATAL: the database system is starting up
Previous Message Tom K 2019-06-02 00:25:24 Re: psql: FATAL: the database system is starting up