From: | Thom Brown <thom(at)linux(dot)com> |
---|---|
To: | Josh Berkus <josh(at)agliodbs(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Strange issues with 9.2 pg_basebackup & replication |
Date: | 2012-05-15 16:36:07 |
Message-ID: | CAA-aLv7RcQMX+k6eFaqNK8By1CPySNZfp1jWTmnOBb=rcJDZ8A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 13 May 2012 16:08, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> More issues: promoting intermediate standby breaks replication.
>
> To be a bit blunt here, has anyone tested cascading replication *at all*
> before this?
>
> So, same setup as previous message.
>
> 1. Shut down master-master.
>
> 2. pg_ctl promote master-replica
>
> 3. replication breaks. error message on replica-replica:
>
> FATAL: timeline 2 of the primary does not match recovery target timeline 1
>
> 4. No amount of adjustment on replica-replica will get it replicating
> again.
>
> Note that replica-replica was configured with:
>
> recovery_target_timeline = 'latest'
I can recreate this "issue", although the docs say:
"Promoting a cascading standby terminates the immediate downstream
replication connections which it serves. This is because the timeline
becomes different between standbys, and they can no longer continue
replication. The affected standby(s) may reconnect to reestablish
streaming replication."
(http://www.postgresql.org/docs/9.2/static/warm-standby.html#CASCADING-REPLICATION)
However, this isn't true when I restart the standby. I've been
informed that this should work fine if a WAL archive has been
configured (which should be used anyway).
But one new problem I appear to have is that once I set up archiving
and restart, then try pg_basebackup, it gets stuck and never shows any
progress. If I terminate pg_basebackup in this state and attempt to
restart it more times than max_wal_senders, it can no longer run, as
pg_basebackup didn't disconnect the stream, so ends up using all
senders. And these show up in pg_stat_replication. I have a theory
that if archiving is enabled, restart postgres then generate some WAL
to the point there is a file or two in the archive, pg_basebackup
can't stream anything. Once I restart the server, it's fine and
continues as normal. This has the same symptoms of the "pg_basebackup
from running standby with streaming" issue.
Steps to recreate:
1) initdb new cluster
2) start new cluster
3) make archive dir (in my case, /tmp/arch) and set the following:
wal_level = hot_standby
max_wal_senders = 3
archive_mode = on
archive_command = 'cp %p /tmp/arch/%f'
4) Set pg_hba.conf to allow streaming replication connections
5) Restart the cluster
6) Create a table and insert a few hundred thousand rows until
/tmp/arch shows some WAL files
7) Run: pg_basebackup -x stream -D s1 -Pv
This actually does finish eventually but it appears to need some
encouragement by generating some WAL and issuing a checkpoint:
thom(at)swift:~/Development$ time pg_basebackup -x stream -D s1 -Pv
xlog start point: 0/4000020
pg_basebackup: starting background WAL receiver
53951/53951 kB (100%), 1/1 tablespace
xlog end point: 0/5DE15E0
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
real 2m37.456s
user 0m0.016s
sys 0m0.724s
If I terminate pg_basebackup and restart it without generating
additional WAL, it doesn't appear to release the streaming connection
ever (or not within my patience limit of a few minutes). And I can't
free these connections without restarting the cluster.
But once I get the standby up and running and acting as a hot standby,
and ignore the current issue with it getting stuck creating a standby
from a standby, I still get the mismatched timeline issue, so the
addition of WAL archiving didn't appear to resolve this for me.
--
Thom
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2012-05-15 16:39:17 | Re: Why do we still have commit_delay and commit_siblings? |
Previous Message | Heikki Linnakangas | 2012-05-15 16:22:42 | Bug in to_tsquery(), and fix |