Re: BUG #16750: VACUUM sporadically fails to access pgstat on Cygwin

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16750: VACUUM sporadically fails to access pgstat on Cygwin
Date: 2020-11-29 06:00:00
Message-ID: dc97c060-59be-c69f-ff57-ef0c1bf5db5e@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

29.11.2020 08:00, PG Bug reporting form wrote:
> pg_regress/vacuum WARNING: could not open statistics file
> "/cygdrive/w/lorikeet/REL_12_STABLE/global.stat": Permission denied
> pg_regress/create_index WARNING: could not open statistics file
> "/cygdrive/w/lorikeet/REL_13_STABLE/global.stat": Device or resource busy
>
> The issue with the "Permission denied" error is already resolved for Windows
> (see bug #16161), but cygwin still suffers from it.

As I have found out, it happens when multiple VACUUM processes access
the pgstat statistics. There can be the autovacuum process or just
several VACUUM's executed simultaneously, i.e. we have the following
scenario:
The first process initiates a call to rename() in PgstatCollectorMain,
via pgstat_vacuum_stat() -> backend_read_statsfile() ->
pgstat_send_inquiry() | PgstatCollectorMain: pgstat_recv_inquiry() ->
pending_write_requests -> pgstat_write_statsfiles() -> rename(tmpfile,
statfile).
The second proccess calls AllocateFile()/fopen(), via
backend_read_statsfile() -> pgstat_read_db_statsfile_timestamp().
Also there can be third process that calls backend_read_statsfile() and
thus fopen(), too.

So one process (PgstatCollectorMain) tries to rename
pg_stat_tmp/global.tmp to pg_stat_tmp/global.stat and another process
(backend) tries to open it, and yet another one can have it open.

The important fact is that rename() in cygwin() is not atomic at all,
and it could be performed in two ways [1]. First, cygwin tries to
execute IRP_MJ_SET_INFORMATION for global.tmp with "Type:
SetRenameInformationFile, ReplaceIfExists: True, FileName: global.stat".
(See LogFile-SharingViolation1.CSV attached.)
If it fails with ACCESS_DENIED (because global.stat is open by another
process, see line 51 of the log), it tries to replace file hard (and bad
for us) way:
IRP_MJ_CREATE for global.stat with "Desired Access: Delete, Disposition:
Open.." (see line 78) and then IRP_MJ_WRITE, IRP_MJ_SET_INFORMATION,
IRP_MJ_CLOSE...

So the first possibility to fail for the opening backend is to get EBUSY
error [1] (see line 89) . In this case we should just wait as we already
do in pgwin32_open().

The second possibility is to get the EACCES error due to the delete
pending state. But it's important to note that the global.stat file is
not deleted at the end, it's replaced by global.tmp. So stat() just
returns zero and in this case we shouldn't just return error (as in
pgwin32_open).

While fixing bug #16161 (when we added similar handling for the "delete
pending" state) we've also encountered an issue [2].
There was a delay for 1 sec added on each open(directory) call as
Windows returns ERROR_ACCESS_DENIED for such calls.
Fortunately, cygwin doesn't do the same. It allows to open(directory)
without an error, as the attached Logfile-006_logical_decoding.CSV shows.
So the checking for stat() result is not necessary to prevent that behavior.
There can be still unnecessary delay for 1 second in case of actual file
permissions problems, but for now I see no code paths on cygwin where it
can be significant.

Please look at the proposed patch, that makes regress/vacuum test (and
an extra test with three vacuums in parallel) pass reliably.

Best regards,
Alexander

[1] https://cygwin.com/git?p=newlib-cygwin.git;a=blob;f=winsup/cygwin/syscalls.cc;h=885ca375ac2b9afaa8fb85cb3f51cbffe9996a3d;hb=HEAD#l2728
[2] https://www.postgresql.org/message-id/b2b8cc4c-b7d7-0d1b-f6ee-a02bbef5bdd0%40gmail.com
[16161] https://www.postgresql.org/message-id/flat/16161-7a985d2f1bbe8f71%40postgresql.org

Attachment Content-Type Size
Logfile-SharingViolation1.CSV text/csv 23.4 KB
Logfile-006_logical_decoding.CSV text/csv 9.1 KB
cygwin-open.patch text/x-patch 5.0 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message James Coleman 2020-11-29 13:21:52 Re: BUG #16741: Error: subplan "SubPlan 1" was not initialized
Previous Message PG Bug reporting form 2020-11-29 05:00:01 BUG #16750: VACUUM sporadically fails to access pgstat on Cygwin