Re: Standby trying "restore_command" before local WAL

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: David Steele <david(at)pgmasters(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Emre Hasegeli <emre(at)hasegeli(dot)com>, Sergei Kornilov <sk(at)zsrv(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "berge(at)trivini(dot)no" <berge(at)trivini(dot)no>, Gürkan Gür <ben(at)gurkan(dot)in>, Raimund Schlichtiger <raimund(dot)schlichtiger(at)innogames(dot)com>, Bernhard Schrader <bernhard(dot)schrader(at)innogames(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Vik Fearing <vik(at)2ndquadrant(dot)fr>
Subject: Re: Standby trying "restore_command" before local WAL
Date: 2018-08-07 15:57:07
Message-ID: 4e22f312-50bb-3103-e114-10498afc9342@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/07/2018 05:42 PM, Stephen Frost wrote:
> Greetings,
>
> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>> On 08/06/2018 09:32 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas(dot)vondra(at)2ndquadrant(dot)com) wrote:
>>>> On 08/06/2018 06:11 PM, Stephen Frost wrote:
>>> WAL checksums are per WAL record, not across the whole file... And,
>>> yes, this seems like something we could probably write code to ensure
>>> we're doing correctly, possibly even without changing the existing WAL
>>> format or maybe we would have to, but either way, seems like additional
>>> code that would need to be written and some serious thought put into it
>>> to make sure it's correct. I'm all for that, just to be clear, but what
>>> I don't think we can do is move forward with a change to just prefer
>>> pg_wal over restore command.
>>
>> While the checksums are per-record, the record also contains xl_prev, so
>> it's effectively a chain of checksums covering the whole file. The other
>> thing you can do is look at the first record of the next segment and use the
>> xl_prev to detect if the previous segment was not partial.
>>
>> But I do agree doing this properly may require writing some new code and
>> checking that those cases are detected correctly.
>
> Right, might be possible but isn't something we necessairly guarantee
> will always work correctly today in this situation where we've got old
> WAL files that were pulled over a period of time (imagine that we copy
> WAL file ABC before PG was done with it, but we don't get around to
> copying WAL file DEF until much later after ABC has been completed and
> DEF is only partial, or other such scenarios). Remember, the scenario
> we're talking about here is where someone did a pg_start_backup, took
> their time copying all the files in PGDATA, including pg_wal, and then
> at some later point ran pg_stop_backup. We have no idea when those WAL
> files were copied and they could have been anywhere between the
> pg_start_backup point and the pg_stop_backup point.
>
>> (Note: There was a discussion about replacing xl_prev with LSN of the
>> current record, and IMHO that would work just as well, although it might be
>> a bit more expensive for some operations.)
>
> I haven't thought very much about this so don't have much of an opinion
> on it one way or the other at this point.
>
>>> CRC's are per WAL record, and possibly some WAL records might not be ok
>>> to replay, or at least we need to make sure that we replay the right set
>>> of WAL in the right order even when there are partial WAL files being
>>> given to PG (that aren't named that way...). The more I think about
>>> this, I think we really need to avoid partial WAL files entirely- what
>>> are we going to do when we get to the end of one? We'd need to request
>>> the full one from the restore command anyway, so seems like we should
>>> just go ahead and get it from the archive, the question is if there's an
>>> easy/cheap way to detect partial WAL files in pg_wal.
>>
>> As explained above, I don't think this is actually a problem. The checksums
>> do cover the whole file thanks to chaining, and there are ways to detect
>> partial segments. IMHO it's fine if we replay a segment and then find out it
>> was partial and that we need to fetch it from archive anyway and re-apply it
>> - it should not be very common case, except when the user does something
>> silly.
>
> As long as we *do* go off and try to fetch that WAL file and replay it,
> and don't assume that the end of that partial WAL file means the end of
> WAL replay, then I think you may be right and that it'd be fine, but it
> does seem a bit risky to me.
>
>>> As I mention above, seems like what we should really be doing is having
>>> a way to know when a WAL file is full but still in pg_wal for some
>>> reason (hasn't been replayed yet due to intential lag on the replica, or
>>> unintentional lag on the replica, etc), and perhaps we even only do that
>>> on replicas or have tools like pg_basebackup and pgbackrest do that when
>>> they're doing a restore, so that it's clear to PG that all these files
>>> are full WAL and they can replay them completely.
>>
>> IMHO we can deduce if from first xl_prev of the next segment (assuming we
>> have the next segment available locally, which we likely have except for the
>> last one).
>
> See above for why I don't think it's actually that simple..
>
>>> I was actually just thinking about having pgbackrest do exactly that. :)
>>> We already have the full sha256 checksum of every WAL file in the
>>> pgbackrest repository, it seems like it wouldn't be too hard to
>>> calculate the checksum of the files in pg_wal and ask the repo what the
>>> checksums are for those WAL files and then use the files in pg_wal if
>>> they checksums match. I can already imagine the argument coming back
>>> though- that'd require additional testing to ensure it's done correctly
>>> and I'm really not sure that it'd end up being all that much better
>>> except in some very special circumstances where there's a low bandwidth
>>> connection to the repo and often a lot of WAL left over in pg_wal.
>>
>> I don't think it can be done in an external tool entirely, at least not
>> using restore_command alone. We remove the local segment before even
>> invoking restore_command, so it's too late to check checksums etc.
>
> Wasn't this entire discussion started because we were calling
> restore_command first instead of reading from pg_wal..? Or do we
> actively go wipe out pg_wal before doing that (I didn't think we did,
> but I could certainly be wrong on that point).
>

That's how I read this part of RestoreArchivedFile:

https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogarchive.c#L110

The very first thing it does is checking if the local file exists, and
if it does then calling unlink().

>> We'd need invoking some other command before restore_command, which would do
>> this comparison and decide whether to use local or remote WAL.
>
> Ugh, that sounds like a lot of additional complication that I'm not sure
> would be all that useful or sensible for this particular case, if that's
> what it would require. I'd rather we try to figure out some way to get
> rid of restore_command entirely instead of bolting on more things around
> it.
>

The simpler the better of course.

All I'm saying is that (assuming my understanding of RestoreArchivedFile
is correct) we can't just do that in the current restore_command. We do
need a way to ask the archive for some metadata/checksums, and
restore_command is too late.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2018-08-07 16:05:08 Re: Standby trying "restore_command" before local WAL
Previous Message Stephen Frost 2018-08-07 15:42:04 Re: Standby trying "restore_command" before local WAL