Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, pgsql-committers <pgsql-committers(at)lists(dot)postgresql(dot)org>
Subject: Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
Date: 2022-01-20 19:27:36
Message-ID: 20220120192736.msyesvdlljn6tptw@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Hi,

On 2022-01-20 09:32:16 -0500, Tom Lane wrote:
> Seems like we need to do 'kill 0' on the PID we get from
> the file to verify that there's really a postmaster there.
> (I wonder how well that works on Windows? perlport claims
> it does, but ...)

We seem to successfully do that in pg_ctl's postmaster_is_alive().

> I wrote:
> > Hmm. I think the problem is that poll_start() thinks it can just call
> > start() a second time after a failure. If it wasn't a true failure
> > but a timeout, then _pid is now set and the second call complains.
>
> Oh, wait --- the case that is failing is after 017_shm.pl has
> intentionally kill -9'd a postmaster, so that its pidfile is
> left behind. The next attempted start fails on shmem id
> conflict, but it doesn't remove the old pidfile, and then
> the code I added to sub start erroneously picks that up
> as a live postmaster PID.

> I fear I still don't have the whole story though because
> per this theory it should fail everywhere, yet it doesn't.

The sequence in regress_log_* that hoverfly is getting and that I am seeing
locally differs. It's a bit hard to tell, because the test doesn't report any
test progress until quite a bit into the test...

For the first kill9, I see:

local:
### Killing node "gnat" using signal 9
...
### Starting node "gnat"
# Running: pg_ctl -w -D /home/andres/build/postgres/dev-assert/vpath/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata -l /home/andres/build/postgres/dev>
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... done
server started
# Postmaster PID for node "gnat" is 2904505

hoverfly:
### Killing node "gnat" using signal 9
...
### Starting node "gnat"
# Running: pg_ctl -w -D /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata -l /scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/log/017_shm_gnat.log -o --cluster-name=gnat start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.
# pg_ctl start failed; logfile:

local log:
2022-01-20 11:02:40.109 PST [2905213] LOG: starting PostgreSQL 15devel on x86_64-pc-linux-gnu, compiled by gcc-12 (Debian 12-20220106-1) 12.0.0 20220106 (experimental) [master r12-6277-g1935db29689], 64-bit
2022-01-20 11:02:40.109 PST [2905213] LOG: listening on Unix socket "/tmp/jAyaRMImQH/.s.PGSQL.53851"
2022-01-20 11:02:40.110 PST [2905216] LOG: database system was interrupted; last known up at 2022-01-20 11:02:39 PST
2022-01-20 11:02:40.111 PST [2905216] LOG: database system was not properly shut down; automatic recovery in progress

hoverfly:
2022-01-20 18:04:53.087 UTC [50659690:1] LOG: starting PostgreSQL 15devel on powerpc-ibm-aix7.2.4.0, compiled by /opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY, 64-bit
2022-01-20 18:04:53.088 UTC [50659690:2] LOG: listening on Unix socket "/home/nm/farm/tmp/oR5PbI8A5R/.s.PGSQL.61673"
2022-01-20 18:04:53.092 UTC [393770:1] LOG: database system was shut down at 2022-01-20 18:04:52 UTC
2022-01-20 18:04:53.099 UTC [50659690:3] LOG: database system is ready to accept connections
2022-01-20 18:04:53.275 UTC [33751596:1] FATAL: lock file "postmaster.pid" already exists
2022-01-20 18:04:53.275 UTC [33751596:2] HINT: Is another postmaster (PID 50659690) running in data directory "/scratch/nm/farm/xlc64v16/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"?

IOW, the test doesn't actually quite seem to be working in the "local" case?
Hence not seeing the problem of picking up some wrong pid.

I think it's a question of how quick the postmaster-death detection logic
is. At first I thought it's a question of USE_POSTMASTER_DEATH_SIGNAL, but I
see the same with the pipe based approach. So it's probably just a question of
event ordering.

I think this test could benefit from
$gnat->append_conf('postgresql.conf', 'log_min_messages=debug3');
or such, so that we see processes exiting.

I also think we ought to emit a debug message when detecting postmaster
death. It's unnecessarily hard to debug them right now.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2022-01-20 19:56:07 Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
Previous Message Robert Haas 2022-01-20 15:51:52 pgsql: Support base backup targets.