Quick Links

Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	skeefe(at)rdx(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio
Date:	2014-04-29 07:23:58
Message-ID:	535F538E.9020206@vmware.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On 04/25/2014 08:43 PM, skeefe(at)rdx(dot)com wrote:
> The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL
> Replication. If the master goes down during a massive transaction and we
> promote the first slave then next slave looks for a WAL log that never
> existed, New timeline before the split of timelines.

I can't reproduce this. Would it be possible to create a self-contained
script that reproduces the whole scenario? Something like the attached
(which I used to try to reproduce this).

> Below is how to recreate the issue:
>
> 1. Create M using postgresql.conf_M. Start M.
> CREATE TABLE t_test (id int4);
>
> 2. Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used
> rsync). Start S1
>
> 3. Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used
> rsync). Start S2
>
> 4. Insert data in t_test table in M
> INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ;
> 5. Important: Do not shutdown M. If you want you can crash M by killing
> pids. I just let it run and immediately proceeded to next step. The idea
> here is to promote S1 before M transmits the last WAL which has the COMMIT
> of the above INSERT.
>
> 6. Promote S1. S1 will change its timeline.
>
> 7. S2 will not recognize the new timeline of its master S1.

Yeah, that's expected behavior, or a known issue if you will, which was
fixed in 9.3. However, S1 should automatically terminate the connection,
with a message in the log like this:

LOG: terminating all walsender processes to force cascaded standby(s) to
update timeline and reconnect

That should allow S2 to find the new timeline, without restarting, as
long as you have a WAL archive set up.

> PGSTOP S2 and
> then PGSTART. S2 will now change its timeline. However, as you see in the
> pg_log, it will wait for a WAL that will never arrive. It will look for WALs
> from previous timeline in new timeline file naming format. E.g it will wait
> for 0000000A00000026000000F1. You will see that such log exists in the name
> 0000000900000026000000F1. So it will wait forever and if you try to connect
> to S2 you will see error “FATAL: the database system is starting up”

This seems to be the crux of this bug report. I just tested this and
didn't see this behavior. S2 tries restoring files from the archive
first, but then it connects to S1 and catches up.

> Recovery.conf for S1:
> restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
> recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'
>
> recovery_target_timeline = 'latest'
>
> recovery.conf for S2:
> restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
> recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'
>
> recovery_target_timeline = 'latest'

There are no primary_conninfo lines here, so you're either not showing
us the full recovery.conf files used, or you haven't in fact set up
cascading replication.

- Heikki

Attachment	Content-Type	Size
setup_cascading_replication.sh	application/x-shellscript	1.3 KB

In response to

BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio at 2014-04-25 17:43:36 from skeefe

Responses

Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio at 2014-04-30 11:09:07 from Heikki Linnakangas

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	pilum.70	2014-04-29 07:41:55	BUG #10171: Specific prepared statement cannot use bitmapOr index scan since 9.2
Previous Message	Heikki Linnakangas	2014-04-29 06:55:00	Re: BUG #10155: BUG? Cann't remove new generated tuples after repeatable read transaction start.