From: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com> |
Subject: | Re: More race conditions in logical replication |
Date: | 2017-07-06 16:20:39 |
Message-ID: | 59c6012e-1d78-dca3-339c-be67fd166d6d@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 06/07/17 17:33, Petr Jelinek wrote:
> On 03/07/17 01:54, Tom Lane wrote:
>> I noticed a recent failure that looked suspiciously like a race condition:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
>>
>> The critical bit in the log file is
>>
>> error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher
>> DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790'
>> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION tap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm line 1198.
>>
>> After poking at it a bit, I found that I can cause several different
>> failures of this ilk in the subscription tests by injecting delays at
>> the points where a slot's active_pid is about to be cleared, as in the
>> attached patch (which also adds some extra printouts for debugging
>> purposes; none of that is meant for commit). It seems clear that there
>> is inadequate interlocking going on when we kill and restart a logical
>> rep worker: we're trying to start a new one before the old one has
>> gotten out of the slot.
>>
>
> Thanks for the test case.
>
> It's not actually that we start new worker fast. It's that we try to
> drop the slot right after worker process was killed but if the code that
> clears slot's active_pid takes too long, it still looks like it's being
> used. I am quite sure it's possible to make this happen for physical
> replication as well when using slots.
>
> This is not something that can be solved by locking on subscriber. ISTM
> we need to make pg_drop_replication_slot behave more nicely, like making
> it wait for the slot to become available (either by default or as an
> option).
>
> I'll have to think about how to do it without rewriting half of
> replication slots or reimplementing lock queue though because the
> replication slots don't use normal catalog access so there is no object
> locking with wait queue. We could use some latch wait with small timeout
> but that seems ugly as that function can be called by user without
> having dropped the slot before so the wait can be quite long (as in
> "forever").
>
Naive fix would be something like attached. But as I said, it's not
exactly pretty.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-Wait-for-slot-to-become-free-in-when-dropping-it.patch | text/x-patch | 6.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2017-07-06 16:37:56 | Re: Challenges preventing us moving to 64 bit transaction id (XID)? |
Previous Message | Ashutosh Bapat | 2017-07-06 15:35:21 | paths in partitions of a dummy partitioned table |