Re: wal_retrieve_retry_interval

From: Ludovic Vaugeois-Pepin <ludovicvp(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: wal_retrieve_retry_interval
Date: 2017-05-28 13:41:18
Message-ID: CAAJDx8Oi0tD+sriCf-7tTorOvRy0syPoEke34phXLfFLFMeNHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, May 28, 2017 at 9:49 AM, Ludovic Vaugeois-Pepin
<ludovicvp(at)gmail(dot)com> wrote:
>
>
> On Sun, May 28, 2017 at 6:54 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>> On Sat, May 27, 2017 at 2:40 PM, Ludovic Vaugeois-Pepin
>> <ludovicvp(at)gmail(dot)com> wrote:
>> > Say, with 9.6.2, a hot_standby fails to connect to a replication slot:
>> > FATAL: could not start WAL streaming: ERROR: replication slot "test3"
>> > does not exist
>> > or
>> > FATAL: could not connect to the primary server: FATAL: the database
>> > system is starting up
>> >
>> > Is there a way to reduce the time it takes until the next attempt? I
>> > assumed, wrongly I think, that this would be wal_retrieve_retry_interval,
>> > but it seems that it won't make a difference. I tried setting it to 3s, but
>> > it seems to take 15s still. Here are two log samples:
>>
>> Could you double-check your configuration? If I set
>> wal_retrieve_retry_interval to 1s on a standby, I am able to see a
>> connection attempt from a WAL receiver happening with this interval of
>> time in the case of repetitive failures.
>
>
> It really is set at 3s on all servers (master and standbies) earlier in the "deployment" process at the same time "listen_addresses", "hot_standby", and others are set. This doesn't seem to happen every time I run tests. I increased logging to DEBUG1. This is what I got when the problem occurred again. Note that the empty line is there in the log file.
>
> < 2017-05-28 09:29:36.127 CEST > LOG: database system was shut down in recovery at 2017-05-28 09:29:03 CEST
> < 2017-05-28 09:29:36.127 CEST > LOG: entering standby mode
> < 2017-05-28 09:29:36.127 CEST > DEBUG: checkpoint record is at 0/80002B8
> < 2017-05-28 09:29:36.128 CEST > DEBUG: redo record is at 0/80002B8; shutdown TRUE
> < 2017-05-28 09:29:36.128 CEST > DEBUG: next transaction ID: 0:2535; next OID: 18660
> < 2017-05-28 09:29:36.128 CEST > DEBUG: next MultiXactId: 1; next MultiXactOffset: 0
> < 2017-05-28 09:29:36.128 CEST > DEBUG: oldest unfrozen transaction ID: 1750, in database 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG: oldest MultiXactId: 1, in database 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG: commit timestamp Xid oldest/newest: 0/0
> < 2017-05-28 09:29:36.128 CEST > DEBUG: transaction ID wrap limit is 2147485397, limited by database with OID 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1
> < 2017-05-28 09:29:36.128 CEST > DEBUG: starting up replication slots
> < 2017-05-28 09:29:36.128 CEST > DEBUG: resetting unlogged relations: cleanup 1 init 0
> < 2017-05-28 09:29:36.129 CEST > DEBUG: initializing for hot standby
> < 2017-05-28 09:29:36.129 CEST > DEBUG: recovery snapshots are now enabled
> < 2017-05-28 09:29:36.129 CEST > LOG: consistent recovery state reached at 0/8000328
> < 2017-05-28 09:29:36.129 CEST > LOG: invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:36.129 CEST > LOG: database system is ready to accept read only connections
> < 2017-05-28 09:29:36.135 CEST > FATAL: could not connect to the primary server: FATAL: the database system is starting up
>
> < 2017-05-28 09:29:36.135 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:51.153 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:51.158 CEST > LOG: fetching timeline history file for timeline 4 from primary server
> < 2017-05-28 09:29:51.160 CEST > LOG: started streaming WAL from primary at 0/8000000 on timeline 3
> < 2017-05-28 09:29:51.160 CEST > LOG: replication terminated by primary server
> < 2017-05-28 09:29:51.160 CEST > DETAIL: End of WAL reached on timeline 3 at 0/8000328.
> < 2017-05-28 09:29:51.161 CEST > DEBUG: walreceiver ended streaming and awaits new instructions
> < 2017-05-28 09:29:51.161 CEST > LOG: new target timeline is 4
> < 2017-05-28 09:29:51.161 CEST > DEBUG: invalid record length at 0/8000328: wanted 24, got 0
> < 2017-05-28 09:29:51.161 CEST > LOG: restarted WAL streaming at 0/8000000 on timeline 4
> < 2017-05-28 09:29:51.191 CEST > LOG: redo starts at 0/8000328

And this is the log with log_min_messages DEBUG5:

< 2017-05-28 14:48:10.108 CEST > LOG: invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:10.108 CEST > DEBUG: switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:10.110 CEST > DEBUG: checkpointer updated shared
memory configuration values
< 2017-05-28 14:48:10.110 CEST > LOG: database system is ready to
accept read only connections
< 2017-05-28 14:48:10.112 CEST > DEBUG: find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-28 14:48:10.112 CEST > DEBUG: find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-28 14:48:10.126 CEST > FATAL: could not connect to the
primary server: FATAL: the database system is starting up

< 2017-05-28 14:48:10.126 CEST > DEBUG: shmem_exit(1): 1
before_shmem_exit callbacks to make
< 2017-05-28 14:48:10.126 CEST > DEBUG: shmem_exit(1): 4
on_shmem_exit callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG: proc_exit(1): 2 callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG: exit(1)
< 2017-05-28 14:48:10.127 CEST > DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG: proc_exit(-1): 0 callbacks to make
< 2017-05-28 14:48:10.127 CEST > DEBUG: switched WAL source from
stream to archive after failure
< 2017-05-28 14:48:10.127 CEST > DEBUG: invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:10.127 CEST > DEBUG: switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:10.130 CEST > DEBUG: reaping dead processes
< 2017-05-28 14:48:11.062 CEST > DEBUG: forked new backend, pid=3184 socket=12
< 2017-05-28 14:48:11.063 CEST > DEBUG: postgres child[3184]: starting with (
< 2017-05-28 14:48:11.063 CEST > DEBUG: postgres
< 2017-05-28 14:48:11.063 CEST > DEBUG: )
< 2017-05-28 14:48:11.063 CEST > DEBUG: InitPostgres
< 2017-05-28 14:48:11.063 CEST > DEBUG: my backend ID is 2
< 2017-05-28 14:48:11.064 CEST > DEBUG: StartTransaction
< 2017-05-28 14:48:11.064 CEST > DEBUG: name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:11.075 CEST > DEBUG: CommitTransaction
< 2017-05-28 14:48:11.075 CEST > DEBUG: name: unnamed; blockState:
STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(0): 1
before_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(0): 6
on_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG: proc_exit(0): 3 callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG: exit(0)
< 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-28 14:48:11.077 CEST > DEBUG: proc_exit(-1): 0 callbacks to make
< 2017-05-28 14:48:11.081 CEST > DEBUG: reaping dead processes
< 2017-05-28 14:48:11.081 CEST > DEBUG: server process (PID 3184)
exited with exit code 0
< 2017-05-28 14:48:11.095 CEST > DEBUG: forked new backend, pid=3186 socket=12
< 2017-05-28 14:48:11.096 CEST > DEBUG: postgres child[3186]: starting with (
< 2017-05-28 14:48:11.096 CEST > DEBUG: postgres
< 2017-05-28 14:48:11.096 CEST > DEBUG: )

...

< 2017-05-28 14:48:24.985 CEST > DEBUG: postgres child[3430]: starting with (
< 2017-05-28 14:48:24.985 CEST > DEBUG: postgres
< 2017-05-28 14:48:24.985 CEST > DEBUG: )
< 2017-05-28 14:48:24.985 CEST > DEBUG: InitPostgres
< 2017-05-28 14:48:24.985 CEST > DEBUG: my backend ID is 2
< 2017-05-28 14:48:24.986 CEST > DEBUG: StartTransaction
< 2017-05-28 14:48:24.986 CEST > DEBUG: name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.989 CEST > DEBUG: CommitTransaction
< 2017-05-28 14:48:24.989 CEST > DEBUG: name: unnamed; blockState:
STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.989 CEST > DEBUG: StartTransactionCommand
< 2017-05-28 14:48:24.989 CEST > DEBUG: StartTransaction
< 2017-05-28 14:48:24.989 CEST > DEBUG: name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.992 CEST > DEBUG: CommitTransactionCommand
< 2017-05-28 14:48:24.992 CEST > DEBUG: CommitTransaction
< 2017-05-28 14:48:24.992 CEST > DEBUG: name: unnamed; blockState:
STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1,
children:
< 2017-05-28 14:48:24.993 CEST > DEBUG: shmem_exit(0): 1
before_shmem_exit callbacks to make
< 2017-05-28 14:48:24.993 CEST > DEBUG: shmem_exit(0): 6
on_shmem_exit callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG: proc_exit(0): 3 callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG: exit(0)
< 2017-05-28 14:48:24.994 CEST > DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
< 2017-05-28 14:48:24.994 CEST > DEBUG: proc_exit(-1): 0 callbacks to make
< 2017-05-28 14:48:24.996 CEST > DEBUG: reaping dead processes
< 2017-05-28 14:48:24.996 CEST > DEBUG: server process (PID 3430)
exited with exit code 0
< 2017-05-28 14:48:25.144 CEST > DEBUG: switched WAL source from
stream to archive after failure
< 2017-05-28 14:48:25.144 CEST > DEBUG: invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:25.144 CEST > DEBUG: switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:25.147 CEST > DEBUG: find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver"
< 2017-05-28 14:48:25.147 CEST > DEBUG: find_in_dynamic_libpath:
trying "/usr/pgsql-9.6/lib/libpqwalreceiver.so"
< 2017-05-28 14:48:25.161 CEST > LOG: fetching timeline history file
for timeline 4 from primary server
< 2017-05-28 14:48:25.166 CEST > LOG: started streaming WAL from
primary at 0/8000000 on timeline 3
< 2017-05-28 14:48:25.166 CEST > DEBUG: sendtime 2017-05-28
14:48:25.16509+02 receipttime 2017-05-28 14:48:25.166539+02
replication apply delay (N/A) transfer latency 1 ms
< 2017-05-28 14:48:25.166 CEST > LOG: replication terminated by primary server
< 2017-05-28 14:48:25.166 CEST > DETAIL: End of WAL reached on
timeline 3 at 0/8000398.
< 2017-05-28 14:48:25.166 CEST > DEBUG: sending write 0/8000398 flush
0/8000398 apply 0/8000398
< 2017-05-28 14:48:25.167 CEST > DEBUG: walreceiver ended streaming
and awaits new instructions
< 2017-05-28 14:48:25.168 CEST > LOG: new target timeline is 4
< 2017-05-28 14:48:25.168 CEST > DEBUG: switched WAL source from
stream to archive after failure
< 2017-05-28 14:48:25.168 CEST > DEBUG: invalid record length at
0/8000398: wanted 24, got 0
< 2017-05-28 14:48:25.168 CEST > DEBUG: switched WAL source from
archive to stream after failure
< 2017-05-28 14:48:25.169 CEST > LOG: restarted WAL streaming at
0/8000000 on timeline 4
< 2017-05-28 14:48:25.170 CEST > DEBUG: sendtime 2017-05-28
14:48:25.168651+02 receipttime 2017-05-28 14:48:25.170063+02
replication apply delay (N/A) transfer latency 1 ms
< 2017-05-28 14:48:25.170 CEST > DEBUG: creating and filling new WAL file
< 2017-05-28 14:48:25.225 CEST > DEBUG: done creating and filling new WAL file
< 2017-05-28 14:48:25.225 CEST > DEBUG: sending write 0/80006C8 flush
0/8000398 apply 0/8000398
< 2017-05-28 14:48:25.226 CEST > DEBUG: sending write 0/80006C8 flush
0/80006C8 apply 0/8000398
< 2017-05-28 14:48:25.227 CEST > DEBUG: sending hot standby feedback
xmin 2535 epoch 0

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2017-05-28 13:51:07 Re: Help with terminology to describe what my software does please?
Previous Message Neil Anderson 2017-05-28 12:49:03 Help with terminology to describe what my software does please?