Function pg_database_size fails with "Permission denied" on a corrupted fsm file

From: Steve Komarov <skomarov(at)tableau(dot)com>
To: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Function pg_database_size fails with "Permission denied" on a corrupted fsm file
Date: 2019-06-21 02:41:35
Message-ID: MWHPR03MB32643208F100F7E7D0F59382CFE70@MWHPR03MB3264.namprd03.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi all,

I've encountered an issue while attempting to estimate the size of a database after executing a sequence of schema-related SQL statements. The issue is that the function pg_database_size() fails with the error "Permission denied" when attempting to access an _fsm file in the PGDATA directory.

Apologies in advance that I cannot provide the full sequence of SQL statements that leads to the failure. The entire sequence is very large, and I haven't been able to isolate a minimal reproducible subset. What I have been able to isolate is the last SQL statement after which a corrupted fsm file appears in the PGDATA directory, in addition to some other facts that I hope would be enough to serve as a starting point of an investigation.

The issue is reproducible with 100% consistency on multiple Windows versions. I've reproduced it at least 20 times during my trials in order to capture the information below.

Summary:

Two otherwise independent operations A and B are performed back-to-back on the same database. Both operations are implemented as Java apps that execute SQL statements through shell invocations of psql.exe. The first operation A is lengthy (about 2 mins) and involves multiple steps including the copying, renaming, and dropping of a schema. The second operation B is simply a call of pg_database_size(). When A and B are performed in that order back-to-back, A succeeds and B fails with the error message:

ERROR: could not stat file "base/16387/65709_fsm": Permission denied

The exact file name varies between trials, but the error is always the same. I refer to the file from the error message as the "corrupted" file for a lack of a better word.

The facts:

* Postgres Version: 9.6.11.
* OS: Reproduces on Windows Server 2012, Windows Server 2016, Windows 10. Does NOT reproduce on Linux.
* The issue reproduces 100% of the time when A and B are run back-to-back, i.e. with no pause between the two. It sometimes reproduces when there is a 5 minute pause. It never reproduces when there is a 15 minute pause.
* Disabling the Anti-virus software does not make a difference.
* There are no relevant errors in the Windows system logs. The only error in the Postgres log is the one above.
* After a back-to-back run of A and B, and upon inspection of the corrupted fsm file, 65709_fsm in the example above, it appears that:
* The corrupted fsm file cannot be copied nor opened even as an administrator. It CAN be deleted.
* The corrupted fsm file has no owner (uid and gid are reported as -1 /Unknown by "stat 65709_fsm" in a Cygwin shell).
* Attempting to list its permissions with the Windows utility "icacls" fails with "Access is denied" even as an administrator.
* The "created", "accessed", and "modified" timestamps however are readable.
* According to the "created" timestamp, the corrupted fsm file is created some time during operation A.
* When the corrupted fsm file is first created during operation A, it appears to be healthy in the sense that it is indistinguishable from the rest of the files in the same directory in terms of its permissions.
* Towards the end of operation A the statement "DROP SCHEMA "xyz" CASCADE" is executed (through a shell invocation of psql.exe) and right after the execution of drop statement the fsm file becomes corrupted.
* This was determined by placing a debug breakpoint on every SQL statement execution in operation A, and running the statements one-by-one, until a corrupted file in the PGDATA directory appears. I used the following command in a Cygwin shell to search for corrupted files: "pgsql\data\base\16387>find . | xargs stat | grep Unknown -B 4 -A 4"
* The only process holding an open handle on the corrupted file after operation A completes is postgres.exe.
* This was determined by using "Find handle or DLL" in Process Explorer from SysInternals and searching for the corrupted file name.
* Stopping Postgres gracefully makes the corrupted fsm file immediately disappear.
* Terminating Postgres forcibly ("terminate process" from SysInernals Process Explorer) makes the corrupted fsm file immediately disappear.
* Letting Postgres run in its steady state causes the corrupted fsm file to disappear about 7 minutes after the completion of operation A.
* After the corrupted file disappears (or is manually deleted), running operation B (pg_database_size) always succeeds. On the other hand, operation B will always fail if the corrupted file is present.
* The corrupted file is always a file with a name ending with _fsm.
* On some databases operation A causes a single fsm file to be corrupted. On others, there are multiple. The highest I've seen in my trials is 5 corrupted fsm files after a single run of operation A.

Other info:
I found a discussion of a similar "Permission Denied" error dating all the way back to 2010: https://www.postgresql.org/message-id/db471ace1001291612l5b21d85du296e4820553e0365%40mail.gmail.com
The suspected culprit in the linked case was the AV software, which is why I tried disabling the AV software in my trials. It made no difference.

According to another discussion, a file on Windows that has no apparent owner and permissions is a file that has been deleted while there's an open lock on it: https://serverfault.com/questions/149833/ntfs-folder-has-no-owner-no-permissions-cant-be-deleted

Regards,
Steven Komarov

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Михаил Денисов 2019-06-21 06:00:56 segfault during SELECT using && ANY (ARRAY[NULL]::BOX2D).
Previous Message Tom Lane 2019-06-21 01:54:46 Re: BUG #15865: ALTER TABLE statements causing "relation already exists" errors when some indexes exist