From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | skeefe(at)rdx(dot)com, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #10142: Downstream standby indefinitely waits for an old WAL log in new timeline on WAL Cascading replicatio |
Date: | 2014-04-29 07:23:58 |
Message-ID: | 535F538E.9020206@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 04/25/2014 08:43 PM, skeefe(at)rdx(dot)com wrote:
> The issues that we are experiencing is with Postgres 9.2.8 Cascading WAL
> Replication. If the master goes down during a massive transaction and we
> promote the first slave then next slave looks for a WAL log that never
> existed, New timeline before the split of timelines.
I can't reproduce this. Would it be possible to create a self-contained
script that reproduces the whole scenario? Something like the attached
(which I used to try to reproduce this).
> Below is how to recreate the issue:
>
> 1. Create M using postgresql.conf_M. Start M.
> CREATE TABLE t_test (id int4);
>
> 2. Create S1 from M using postgresql.conf_S1 and recovery.conf_S1 (I used
> rsync). Start S1
>
> 3. Create S2 from M using postgresql.conf_S2 and recovery.conf_S2 (I used
> rsync). Start S2
>
> 4. Insert data in t_test table in M
> INSERT INTO t_test SELECT * FROM generate_series(1, 250000) ;
> 5. Important: Do not shutdown M. If you want you can crash M by killing
> pids. I just let it run and immediately proceeded to next step. The idea
> here is to promote S1 before M transmits the last WAL which has the COMMIT
> of the above INSERT.
>
> 6. Promote S1. S1 will change its timeline.
>
> 7. S2 will not recognize the new timeline of its master S1.
Yeah, that's expected behavior, or a known issue if you will, which was
fixed in 9.3. However, S1 should automatically terminate the connection,
with a message in the log like this:
LOG: terminating all walsender processes to force cascaded standby(s) to
update timeline and reconnect
That should allow S2 to find the new timeline, without restarting, as
long as you have a WAL archive set up.
> PGSTOP S2 and
> then PGSTART. S2 will now change its timeline. However, as you see in the
> pg_log, it will wait for a WAL that will never arrive. It will look for WALs
> from previous timeline in new timeline file naming format. E.g it will wait
> for 0000000A00000026000000F1. You will see that such log exists in the name
> 0000000900000026000000F1. So it will wait forever and if you try to connect
> to S2 you will see error “FATAL: the database system is starting up”
This seems to be the crux of this bug report. I just tested this and
didn't see this behavior. S2 tries restoring files from the archive
first, but then it connects to S1 and catches up.
> Recovery.conf for S1:
> restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
> recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'
>
> recovery_target_timeline = 'latest'
>
> recovery.conf for S2:
> restore_command = '/data/postgres/rep_poc/restore_command.sh %f %p %r'
> recovery_end_command = 'rm -f /data/postgres/rep_poc/trigger.cfg'
>
> recovery_target_timeline = 'latest'
There are no primary_conninfo lines here, so you're either not showing
us the full recovery.conf files used, or you haven't in fact set up
cascading replication.
- Heikki
Attachment | Content-Type | Size |
---|---|---|
setup_cascading_replication.sh | application/x-shellscript | 1.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | pilum.70 | 2014-04-29 07:41:55 | BUG #10171: Specific prepared statement cannot use bitmapOr index scan since 9.2 |
Previous Message | Heikki Linnakangas | 2014-04-29 06:55:00 | Re: BUG #10155: BUG? Cann't remove new generated tuples after repeatable read transaction start. |