DELETE PENDING strikes back, via pg_ctl stop/start

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: DELETE PENDING strikes back, via pg_ctl stop/start
Date: 2024-08-21 10:00:00
Message-ID: 8eda5ecc-24c5-95ce-d719-1585e2d693b2@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

As a recent failure, produced by drongo [1], shows, pg_ctl stop/start
sequence may break on Windows due to the transient DELETE PENDING state of
posmaster.pid.

Please look at the excerpt from the failure log:
...
pg_createsubscriber: stopping the subscriber
2024-08-19 18:02:47.608 UTC [6988:4] LOG:  received fast shutdown request
2024-08-19 18:02:47.608 UTC [6988:5] LOG:  aborting any active transactions
2024-08-19 18:02:47.612 UTC [5884:2] FATAL:  terminating walreceiver process due to administrator command
2024-08-19 18:02:47.705 UTC [7036:1] LOG:  shutting down
pg_createsubscriber: server was stopped
### the server instance (1) emitted only "shutting down" yet, but pg_ctl
### considered it stopped and returned 0 to pg_createsubscriber
[18:02:47.900](2.828s) ok 29 - run pg_createsubscriber without --databases
...
pg_createsubscriber: starting the standby with command-line options
pg_createsubscriber: pg_ctl command is: ...
2024-08-19 18:02:48.163 UTC [5284:1] FATAL:  could not create lock file "postmaster.pid": File exists
pg_createsubscriber: server was started
pg_createsubscriber: checking settings on subscriber
### pg_createsubscriber attempts to start new server instance (2), but
### it fails due to "postmaster.pid" still found on disk
2024-08-19 18:02:48.484 UTC [6988:6] LOG:  database system is shut down
### the server instance (1) is finally stopped and postmaster.pid unlinked

With extra debug logging and the ntries limit decreased to 10 (in
CreateLockFile()), I reproduced the failure easily (when running 20 tests
in parallel) and got additional information (see attached).

IIUC, the issue is caused by inconsistent checks for postmaster.pid
existence:
"pg_ctl stop" ... -> get_pgpid() calls fopen(pid_file, "r"),
 which fails with ENOENT for the DELETE_PENDING state (see
 pgwin32_open_handle()).

"pg_ctl start" ... -> CreateLockFile() calls
    fd = open(filename, O_RDWR | O_CREAT | O_EXCL, pg_file_create_mode);
which fails with EEXISTS for the same state of postmaster.pid.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-19%2017%3A32%3A54

Best regards,
Alexander

Attachment Content-Type Size
pg_ctl-debugging.patch text/x-patch 3.1 KB
regress_log_040_pg_createsubscriber.tar.bz2 application/x-bzip 7.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melih Mutlu 2024-08-21 10:04:16 Re: ANALYZE ONLY
Previous Message Amit Kapila 2024-08-21 09:57:46 Re: [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber