From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: base backup vs. concurrent truncation |
Date: | 2023-04-25 00:03:40 |
Message-ID: | 20230425000340.uq2c43diuyfpr4au@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-04-21 09:42:57 -0400, Robert Haas wrote:
> Apologies if this has already been discussed someplace, but I couldn't
> find a previous discussion. It seems to me that base backups are
> broken in the face of a concurrent truncation that reduces the number
> of segments in a relation.
I think
https://www.postgresql.org/message-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de
and the commits + discussions referenced from there is relevant for the topic.
> Suppose we have a relation that is 1.5GB in size, so that we have two
> files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back
> those files up in whichever order the directory scan finds them.
> Suppose we back up 23456.1 first. Then the relation is truncated to
> 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we
> back up the file 23456. Now our backup contains files 23456 and
> 23456.1, each 0.5GB. But this breaks the invariant in md.c:
> * On disk, a relation must consist of consecutively numbered segment
> * files in the pattern
> * -- Zero or more full segments of exactly RELSEG_SIZE blocks each
> * -- Exactly one partial segment of size 0 <= size <
> RELSEG_SIZE blocks
> * -- Optionally, any number of inactive segments of size 0 blocks.
>
> basebackup.c's theory about relation truncation is that it doesn't
> really matter because WAL replay will fix things up. But in this case,
> I don't think it will, because WAL replay relies on the above
> invariant holding. As mdnblocks says:
>
> /*
> * If segment is exactly RELSEG_SIZE, advance to next one.
> */
> segno++;
>
> So I think what's going to happen is we're not going to notice 23456.1
> when we recover the backup. It will just sit there as an orphaned file
> forever, unless we extend 23456 back to a full 1GB, at which point we
> might abruptly start considering that file part of the relation again.
One important point is that I think 23456.1 at that point can contain
data. Which can lead to deleted rows reappearing, vacuum failing due to rows
from before relfrozenxid existing, etc.
> Assuming I'm not wrong about all of this, the question arises: whose
> fault is this, and what to do about it? It seems to me that it's a bit
> hard to blame basebackup.c, because if you used pg_backup_start() and
> pg_backup_stop() and copied the directory yourself, you'd have exactly
> the same situation, and while we could (and perhaps should) teach
> basebackup.c to do something smarter, it doesn't seem realistic to
> impose complex constraints on the user's choice of file copy tool.
Agreed.
> So I think the problem is with md.c assuming that its invariant must
> hold on a cluster that's not guaranteed to be in a consistent state.
> But mdnblocks() clearly can't try to open every segment up to whatever
> the maximum theoretical possible segment number is every time it's
> invoked, because that would be wicked expensive. An idea that occurs
> to me is to remove all segment files following the first partial
> segment during startup, before we begin WAL replay. If that state
> occurs at startup, then either we have a scenario involving
> truncation, like those above, or a scenario involving relation
> extension, where we added a new segment and that made it to disk but
> the prior extension of the previous last segment file to maximum
> length did not. But in that case, WAL replay should, I think, fix
> things up. However, I'm not completely sure that there isn't some hole
> in this theory, and this way forward also doesn't sound particularly
> cheap. Nonetheless I don't have another idea right now.
What we've discussed somewhere in the past is to always truncate N+1 when
creating the first page in N. I.e. if we extend into 23456.1, we truncate
23456.2 to 0 blocks. As far as I can tell, that'd solve this issue?
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-04-25 00:16:23 | Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call |
Previous Message | Jonathan S. Katz | 2023-04-24 23:20:30 | Re: pg_stat_io not tracking smgrwriteback() is confusing |