Re: Need Force flag for pg_drop_replication_slot()

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Need Force flag for pg_drop_replication_slot()
Date: 2015-05-29 17:53:30
Message-ID: 5568A79A.2070400@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/29/2015 10:45 AM, Stephen Frost wrote:
> Andres,
>
> * Andres Freund (andres(at)anarazel(dot)de) wrote:
>> On 2015-05-29 10:15:56 -0700, Josh Berkus wrote:
>>> pg_drop_replication_slot() can be a time-critical function when the
>>> master is running out of disk space because the replica is falling
>>> behind.
>>
>> I don't buy this argument. The same is true for DROP TABLE, TRUNCATE,
>> DROP DATABASE etc.
>
> I disagree about that being the same.
>
>> I mean, I agree it'd be convenient, but I can't see it as "critical".

So, here's they scenario:

1. you're almost out of disk space due to a replica falling behind, like
down to 16mb left. Or maybe you are out of disk space.

2. You need to drop the laggy replication slots in a hurry to get your
master working again.

3. Now you have to do this timing-sensitive two-stage drop to make it work.

When our users are having production emergencies, I don't think that
it's helpful for us to make the process of getting out of those
situations more complicated than it absolutely has to be.

> Just a random thought- do we check the LOGIN attribute for replication
> connections? If so, you could tweak that, but that may be an issue if
> you have multiple replicas using the same role.
>
> I'm not sure that it's *critical*, but I could see an argument for
> adding this post-feature-freeze, which I'm guessing is what Josh was
> getting at.

Well, I'll let others decide that. If we could come up with a script
which would reliably do the terminate-then-drop, it would be fine for
9.5. I'm not sure that's possible though, because I don't see any way
to infallibly relate the pg_stat_replication entry with the
pg_replication_slot entry. Imagine having 3 slots and 6 replicas, and
only one slot is behind; how do you figure out what to terminate?

>
>>> While I'm just doing this during testing, it could be a critical fail in
>>> production. I think the simplest way to resolve this would be to add a
>>> boolean flag to pg_drop_replication_slot(), which would terminate the
>>> replication connection and delete the slot as a single operation.
>>
>> There's no "single operation" for terminating a backend *and* doing
>> something...
>
> That's a good point, we'd need to figure out how to make this actually
> work reliably in the face of a very fast reconnecting process, if we're
> going to do it.

Yeah, which means that this is probably something for 9.6. Although if
we can at least come up with something for the documentation for 9.5, it
would be really helpful.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2015-05-29 17:59:10 Re: fsync-pgdata-on-recovery tries to write to more files than previously
Previous Message Andres Freund 2015-05-29 17:49:51 Re: pgindent vs emacs