Re: Cygwin cleanup

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Cygwin cleanup
Date: 2023-01-12 04:39:49
Message-ID: 20230112043949.GQ9837@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 07, 2023 at 12:39:11AM +1300, Thomas Munro wrote:
> On Wed, Nov 9, 2022 at 2:04 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > +data_sync_retry = on
>
> Sharing with the list some clues that Justin and I figured out about
> what that part is doing. Without it, you get failures like:
>
> PANIC: could not open file "pg_logical/snapshots/0-14FE6B0.snap":
> No such file or directory
>
> That's been seen before:
>
> https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us
>
> That thread concluded that the operating system must have a non-atomic
> rename(), ie a kernel bug. I don't know why Cygwin would display that
> behaviour and our native Windows build not; maybe timing, or maybe our
> own open() and rename() wrappers for Windows do something important
> differently than Cygwin's open() and rename().
>
> On reflection, that seems a bit too flimsy to have in-tree without
> more investigation, which I won't have time for myself, so I'm going
> to withdraw this entry.

Not so fast :)

Here's my latest copy of the patch. Most recently, rather than setting
data_sync_retry=no, I changed to call fsync_fname_ext() rather than
fsync_fname(), which uses PANIC (except when data_sync_retry is
disabled). That seems to work, showing that the problem is limited to
SnapBuildSerialize(), and not a problem with all fsync()...

https://cirrus-ci.com/task/5990885733695488

Thomas raised a good question, which was how the tests were passing when
SnapBuildSerialize() was raising an error, which is what it would've
been doing when I used data_sync_retry=no.

So .. why is wal_sync_method being used to control fsync for things
other than WAL?

See 6dc7760ac (c. 2005) which added wal_fsync_writethrough, at which
point (since 9b178555f, c. 2004) wal_sync_method was already being used
for SLOG.

Now, it's also being used for logical decoding (since b89e1510 and
858ec1185, c. 2014) in rewriteheap.c/snapbuild.c. And pidfiles (since
ee0e525bf, 2010). And the control file (8b938d36f7, 2019). Note that
data_sync_retry wasn't added until 9ccdd7f66 (c. 2018)

It looks like logical decoding may be the "most wrong" place that
wal_sync_method is being used, so maybe my change is reasonable to
consider, and not just a workaround.

I'm going to re-open the CF entry to let this run for a while to see how
it works out.

--
Justin

Attachment Content-Type Size
0001-WIP-CI-support-for-Cygwin.patch text/x-diff 9.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-01-12 04:42:11 Re: Spinlock is missing when updating two_phase of ReplicationSlot
Previous Message Peter Smith 2023-01-12 04:23:49 Re: Perform streaming logical transactions by background workers and parallel apply