From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | "Anton A(dot) Melnikov" <aamelnikov(at)inbox(dot)ru> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" |
Date: | 2023-03-03 21:39:20 |
Message-ID: | CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov <aamelnikov(at)inbox(dot)ru> wrote:
> On 17.02.2023 06:21, Thomas Munro wrote:
> > BTW there are at least two other places where PostgreSQL already knows
> > that concurrent reads and writes are possibly non-atomic (and we also
> > don't even try to get the alignment right, making the question moot):
> > pg_basebackup, which enables full_page_writes implicitly if you didn't
> > have the GUC on already, and pg_rewind, which refuses to run if you
> > haven't enabled full_page_writes explicitly (as recently discussed on
> > another thread recently; that's an annoying difference, and also an
> > annoying behaviour if you know your storage doesn't really need it!)
>
> It seems a good topic for a separate thread patch. Would you provide a
> link to the thread you mentioned please?
https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com
> > Therefore, we need a solution for Windows too. I tried to write the
> > equivalent code, in the attached. I learned a few things: Windows
> > locks are mandatory (that is, if you lock a file, reads/writes can
> > fail, unlike Unix). Combined with async release, that means we
> > probably also need to lock the file in xlog.c, when reading it in
> > xlog.c:ReadControlFile() (see comments). Before I added that, on a CI
> > run, I found that the read in there would sometimes fail, and adding
> > the locking fixed that. I am a bit confused, though, because I
> > expected that to be necessary only if someone managed to crash while
> > holding the write lock, which the CI tests shouldn't do. Do you have
> > any ideas?
>
> Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
> Maybe logs of such a fail were saved somewhere? I would like to see
> them if possible.
I think it was this one:
https://cirrus-ci.com/task/5004082033721344
For example, see subscription/011_generated which failed like this:
2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not
read file "global/pg_control": Permission denied
That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested. There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.
But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):
With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently). Changes:
1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)
Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail. So, in this
version I protected that sendFile() with ControlFileLock. But...
Isn't that a bit strange? To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation. That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).
Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear. Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup. I'm not sure about any of that, though, it's just an idea,
not tested.
> > Or maybe the problem is/was too theoretical before...
>
> As far as i understand, this problem has always been, but the probability of
> this is extremely small in practice, which is directly pointed in
> the docs [4]:
> "So while it is theoretically a weak spot, pg_control does not seem
> to be a problem in practice."
I guess that was talking about power loss atomicity again? Knowledge
of the read/write atomicity problem seems to be less evenly
distributed (and I think it became more likely in Linux > 3.something;
and the Windows situation possibly hadn't been examined by anyone
before).
> > Here's a patch like that.
>
> In this patch, the problem is solved for the live database and
> maybe remains for some possible cases of an external backup. In a whole,
> i think it can be stated that this is a sensible step forward.
>
> Just like last time, i ran a long stress test under windows with current patch.
> There were no errors for more than 3 days even with fsync=off.
Thanks!
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Lock-pg_control-while-reading-or-writing.patch | text/x-patch | 6.2 KB |
v3-0002-Make-wal_sync_method-fdatasync-the-default-on-all.patch | text/x-patch | 6.0 KB |
v3-0003-Apply-wal_sync_method-to-pg_control-file.patch | text/x-patch | 5.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-03-03 22:08:32 | Re: Making empty Bitmapsets always be NULL |
Previous Message | Jacob Champion | 2023-03-03 21:38:05 | Re: zstd compression for pg_dump |