Random pg_upgrade 004_subscription test failure on drongo

From: vignesh C <vignesh21(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Random pg_upgrade 004_subscription test failure on drongo
Date: 2025-03-13 09:04:15
Message-ID: CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1].
It has failed with the following:
Restoring database schemas in the new cluster
*failure*

Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting

Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.
I checked old buildfarm failures and found a similar failure in drongo
for pg_upgarde/003_logical_slots at [2].

The test failure can happen because of the reason mentioned at [3] and [6].
I'm just requoting the contents from there, if you have checked [3]
and [6] skip the below contents:
---------------------------------------------
The failure reason for the earlier failure was found to be:
According to it, the TRUNCATE command seemed to be failed due to a
"File exists" error.
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:
could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...

## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [4] pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:

```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING). In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error. We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```

The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------

This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h

Attached is a patch in similar lines for 004_subscription.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35
[2] - https://www.postgresql.org/message-id/flat/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3] - https://www.postgresql.org/message-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[4] - https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[5] - https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[6]- https://www.postgresql.org/message-id/f0d303f1-e380-5988-91c7-74b755abd4bb%40gmail.com

Regards,
Vignesh

Attachment Content-Type Size
0001-Fix-an-intermetant-BF-failure-in-004_subscription.patch application/octet-stream 1.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jakub Wartak 2025-03-13 09:22:25 Re: Draft for basic NUMA observability
Previous Message Kirill Reshke 2025-03-13 09:00:52 Re: relfilenode statistics