| From: | Magnus Hagander <magnus(at)hagander(dot)net> | 
|---|---|
| To: | Ludovic Vaugeois-Pepin <ludovicvp(at)gmail(dot)com> | 
| Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: [GENERAL] pg_basebackup error: replication slot "pg_basebackup_2194" already exists | 
| Date: | 2017-05-31 16:22:18 | 
| Message-ID: | CABUevExShf2WWhmY6W7HSEYX669sLkcEUWLDPnzHpUamPaDUXA@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general pgsql-hackers | 
On Wed, May 31, 2017 at 12:20 AM, Ludovic Vaugeois-Pepin <
ludovicvp(at)gmail(dot)com> wrote:
> On Tue, May 30, 2017 at 9:32 PM, Magnus Hagander <magnus(at)hagander(dot)net>
> wrote:
> > On Tue, May 30, 2017 at 9:14 PM, Ludovic Vaugeois-Pepin
> > <ludovicvp(at)gmail(dot)com> wrote:
> >>
> >> I ran into the issue described below with 10.0 beta. The error I got is:
> >>
> >> pg_basebackup: could not create temporary replication slot
> >> "pg_basebackup_2194": ERROR:  replication slot "pg_basebackup_2194"
> >> already exists
> >>
> >> A race condition? Or maybe I am doing something wrong.
> >>
> >>
> >>
> >>
> >>
> >> Release:
> >>     Name        : postgresql10-server
> >>     Version     : 10.0
> >>     Release     : beta1PGDG.rhel7
> >>
> >>
> >> Test Type:
> >>     Functional testing of a pacemaker resource agent
> >> (https://github.com/ulodciv/pgha)
> >>
> >>
> >> Test Detail:
> >>     During context/environement setup, pg_basebackup is invoked (in
> >> parallel) from multiple virtual machines. The backups are then started
> >> as asynchronously replicated hot standbies.
> >>
> >>
> >> Platform:
> >>     Centos 7.3
> >>
> >>
> >> Installation Method:
> >>     yum -y install
> >>
> >> https://download.postgresql.org/pub/repos/yum/testing/10/
> redhat/rhel-7-x86_64/pgdg-redhat10-10-1.noarch.rpm
> >>     yum -y install postgresql10-server postgresql10-contrib
> >>
> >>
> >> Platform Detail:
> >>
> >>
> >> Test Procedure:
> >>
> >>     Have pg_basebackup run simultaneously on multiple hosts against
> >> the same instance eg:
> >>
> >>         pg_basebackup -h test4 -p 5432 -D /var/lib/pgsql/10/data -U
> repl1
> >> -Xs
> >>
> >>
> >> Failure?
> >>
> >> E               deploylib.deployer_error.DeployerError:
> >> postgres(at)test5: got exit status 1 for:
> >> E               pg_basebackup -h test4 -p 5432 -D
> >> /var/lib/pgsql/10/data -U repl1 -Xs
> >> E               stderr: pg_basebackup: could not create temporary
> >> replication slot "pg_basebackup_2194": ERROR:  replication slot
> >> "pg_basebackup_2194" already exists
> >> E               pg_basebackup: child process exited with error 1
> >> E               pg_basebackup: removing data directory
> >> "/var/lib/pgsql/10/data"
> >>
> >>
> >> Test Results:
> >>
> >>
> >> Comments:
> >>     This seems to be new with 10. I recently began testing the
> >> pacemaker resource agent against PG 10. I never had (or noticed) this
> >> failure with 9.6.1 and 9.6.2.
> >
> >
> > Hah, that's an interesting failure. In the name of the slot, the 2194
> comes
> > from the pid -- but it's the pid of pg_basebackup.
> >
> > I assume you're not running the two pg_basebackup processes on the same
> > machine? Is it predictable when this happens (meaning that the pid value
> is
> > actually predictable), or do you have to run it a large numbe rof times
> > before it happens?
>
>
> Indeed, I run it from two VMs that were created from the same .ova
> (packaged VM).
> I ran into this once, however I have been running tests on 10.0 for a
> couple of days or so.
>
> My guess is that the two hosts ended up using the same pid when
> running the backup.
>
Moving this one over to -hackers to discuss the fix, as this is clearly an
issue.
Right now, pg_basebackup will use the pid of the *client* process to
generate it's ephemeral slot name. Per this report that seems like it can
definitely be a problem.
One of my first thoughts would be to instead use the pid of the *server* to
do that, as this will be guaranteed to be unique. However, the client can't
access the pid of the server as it is now, and its the client that has to
create the name.
One way to do that would be to include the pid of the walsender backend in
the reply to IDENTIFY_SYSTEM, and then use that. What do people think of
that idea?
Other suggestions?
I will add this to the 10.0 open item lists.
-- 
 Magnus Hagander
 Me: https://www.hagander.net/ <http://www.hagander.net/>
 Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Julien Rouhaud | 2017-05-31 16:32:30 | Re: Ora2Pg-Database migration report | 
| Previous Message | Tom Lane | 2017-05-31 15:49:15 | Re: Build PostgreSQL With XML Support on Linux | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Teodor Sigaev | 2017-05-31 16:24:59 | Perfomance bug in v10 | 
| Previous Message | Alexander Korotkov | 2017-05-31 16:18:16 | Re: ALTER INDEX .. SET STATISTICS ... behaviour |