From: | Justin Pryzby <pryzby(at)telsasoft(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Cygwin cleanup |
Date: | 2023-01-12 04:39:49 |
Message-ID: | 20230112043949.GQ9837@telsasoft.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jan 07, 2023 at 12:39:11AM +1300, Thomas Munro wrote:
> On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > +data_sync_retry = on
>
> Sharing with the list some clues that Justin and I figured out about
> what that part is doing. Without it, you get failures like:
>
> PANIC: could not open file "pg_logical/snapshots/0-14FE6B0.snap":
> No such file or directory
>
> That's been seen before:
>
> https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us
>
> That thread concluded that the operating system must have a non-atomic
> rename(), ie a kernel bug. I don't know why Cygwin would display that
> behaviour and our native Windows build not; maybe timing, or maybe our
> own open() and rename() wrappers for Windows do something important
> differently than Cygwin's open() and rename().
>
> On reflection, that seems a bit too flimsy to have in-tree without
> more investigation, which I won't have time for myself, so I'm going
> to withdraw this entry.
Not so fast :)
Here's my latest copy of the patch. Most recently, rather than setting
data_sync_retry=no, I changed to call fsync_fname_ext() rather than
fsync_fname(), which uses PANIC (except when data_sync_retry is
disabled). That seems to work, showing that the problem is limited to
SnapBuildSerialize(), and not a problem with all fsync()...
https://cirrus-ci.com/task/5990885733695488
Thomas raised a good question, which was how the tests were passing when
SnapBuildSerialize() was raising an error, which is what it would've
been doing when I used data_sync_retry=no.
So .. why is wal_sync_method being used to control fsync for things
other than WAL?
See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which
point (since 9b178555f, c. 2004) wal_sync_method was already being used
for SLOG.
Now, it's also being used for logical decoding (since b89e1510 and
858ec1185, c. 2014) in rewriteheap.c/snapbuild.c. And pidfiles (since
ee0e525bf, 2010). And the control file (8b938d36f7, 2019). Note that
data_sync_retry wasn't added until 9ccdd7f66 (c. 2018)
It looks like logical decoding may be the "most wrong" place that
wal_sync_method is being used, so maybe my change is reasonable to
consider, and not just a workaround.
I'm going to re-open the CF entry to let this run for a while to see how
it works out.
--
Justin
Attachment | Content-Type | Size |
---|---|---|
0001-WIP-CI-support-for-Cygwin.patch | text/x-diff | 9.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-01-12 04:42:11 | Re: Spinlock is missing when updating two_phase of ReplicationSlot |
Previous Message | Peter Smith | 2023-01-12 04:23:49 | Re: Perform streaming logical transactions by background workers and parallel apply |