From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Random pg_upgrade 004_subscription test failure on drongo |
Date: | 2025-03-13 09:04:15 |
Message-ID: | CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
In one of the recent buildfarm runs, the test
pg_upgrade/004_subscription test fails at [1].
It has failed with the following:
Restoring database schemas in the new cluster
*failure*
Consult the last few lines of
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/004_subscription/data/t_004_subscription_new_sub_data/pgdata/pg_upgrade_output.d/20250310T194018.517/log/pg_upgrade_dump_1.log"
for
the probable cause of the failure.
Failure, exiting
Unfortunately we don't have pg_upgrade_output.d contents in buildfarm
to see what is the exact reason.
I checked old buildfarm failures and found a similar failure in drongo
for pg_upgarde/003_logical_slots at [2].
The test failure can happen because of the reason mentioned at [3] and [6].
I'm just requoting the contents from there, if you have checked [3]
and [6] skip the below contents:
---------------------------------------------
The failure reason for the earlier failure was found to be:
According to it, the TRUNCATE command seemed to be failed due to a
"File exists" error.
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:
could not create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...
## Analysis
I think it was caused due to the STATUS_DELETE_PENDING failure, not
related with recent
updates for pg_upgrade.
The file "base/1/2683" is an index file for
pg_largeobject_loid_pn_index, and the
output meant that file creation failed. Below is a backtrace.
```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```
But this is strange. Before calling mdcreate(), we surely unlink the
file which have the same name. Below is a trace until unlink.
```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```
I found Thomas said that [4] pgunlink sometimes could not remove a
file even if it returns OK, at that time NTSTATUS is
STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle()
mentions the same thing:
```
/*
* ERROR_ACCESS_DENIED is returned if the file is deleted but not yet
* gone (Windows NT status code is STATUS_DELETE_PENDING). In that
* case, we'd better ask for the NT status too so we can translate it
* to a more Unix-like error. We hope that nothing clobbers the NT
* status in between the internal NtCreateFile() call and CreateFile()
* returning.
*
```
The definition of STATUS_DELETE_PENDING can be seen in [5]. Based on
that, indeed, open() would be able to fail with STATUS_DELETE_PENDING
if the deletion is pending but it is trying to open.
---------------------------------------------
This was fixed by the following change in the target upgrade nodes:
bgwriter_lru_maxpages = 0
checkpoint_timeout = 1h
Attached is a patch in similar lines for 004_subscription.
[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-03-10%2019%3A26%3A35
[2] - https://www.postgresql.org/message-id/flat/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[3] - https://www.postgresql.org/message-id/TYAPR01MB5866A4E7342088E91362BEF0F5BBA%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[4] - https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[5] - https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[6]- https://www.postgresql.org/message-id/f0d303f1-e380-5988-91c7-74b755abd4bb%40gmail.com
Regards,
Vignesh
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-an-intermetant-BF-failure-in-004_subscription.patch | application/octet-stream | 1.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jakub Wartak | 2025-03-13 09:22:25 | Re: Draft for basic NUMA observability |
Previous Message | Kirill Reshke | 2025-03-13 09:00:52 | Re: relfilenode statistics |