Re: More race conditions in logical replication

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Subject: Re: More race conditions in logical replication
Date: 2017-07-25 17:42:23
Message-ID: 20170725174223.cf2z3venmka4gxta@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Petr Jelinek wrote:
> On 25/07/17 01:33, Alvaro Herrera wrote:
> > Alvaro Herrera wrote:
> >> I'm back at looking into this again, after a rather exhausting week. I
> >> think I have a better grasp of what is going on in this code now,
> >
> > Here's an updated patch, which I intend to push tomorrow.
>
> I think the ConditionVariablePrepareToSleep() call in
> ReplicationSlotAcquire() needs to be done inside the mutex lock
> otherwise there is tiny race condition where the process which has slot
> acquired might release the slot between the mutex unlock and the
> ConditionVariablePrepareToSleep() call which means we'll never get
> signaled and ConditionVariableSleep() will wait forever.

Hmm, yeah, that's not good. However, I didn't like the idea of putting
it inside the locked area, as it's too much code. Instead I added just
before acquiring the spinlock. We cancel the sleep unconditionally
later on if we didn't need to sleep after all.

As I see it, we need to backpatch at least parts of this patch. I've
received reports that in earlier branches pglogical and BDR can
sometimes leave slots behind when removing nodes, and I have a hunch
that it can be explained by the bugs being fixed here. Now we cannot
use condition variables in back-branches, so we'll need to figure out
how to actually do it ...

> As a side note, the ConditionVariablePrepareToSleep()'s comment could be
> improved because currently it says the only advantage is that we skip
> double-test in the beginning of ConditionVariableSleep(). But that's not
> true, it's essential for preventing race conditions like the one above
> because it puts the current process into waiting list so we can be sure
> it will be signaled on broadcast once ConditionVariablePrepareToSleep()
> has been called.

Hmm.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2017-07-25 17:44:55 Re: Create language syntax is not proper in pg_dumpall and not working using pg_upgrade
Previous Message Tom Lane 2017-07-25 17:41:34 Re: Testlib.pm vs msys