From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org, Magnus Hagander <magnus(at)hagander(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | pg_basebackup WAL streamer shutdown is bogus - leading to slow tests |
Date: | 2022-01-16 09:22:07 |
Message-ID: | 20220116092207.me3kihxgjcbyshae@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I was wondering in [1] what we could do about the slowest tests on
windows.
On 2021-12-31 11:25:28 -0800, Andres Freund wrote:
> Picking a random successful cfbot run [1] I see the following tap tests taking
> more than 20 seconds:
>
> 67188 ms pg_basebackup t/010_pg_basebackup.pl
> 59710 ms recovery t/001_stream_rep.pl
Comparing these times to measurements taken on my normal linux workstation,
something seemed just *very* off, even with a slow CI instance and windows in
the mix.
A bunch of printf debugging later, I realized the problem is that several of
the pg_basebackups in tests take a *long* time. E.g. for t/001_stream_rep.pl
the backups from the standby each take just over 10s. That's awfully
specific...
# Taking pg_basebackup my_backup from node "standby_1"
# Running: pg_basebackup -D C:/dev/postgres/./tmp_check/t_001_stream_rep_standby_1_data/backup/my_backup -h C:/Users/myadmin/AppData/Local/Temp/yba26PBYX1 -p 59181 --checkpoint fast --no-sync --label my_backup -v
# ran in 10.145s
# Backup finished
This reproduceably happens and it's *not* related to the socket shutdown()
changes we've been debugging lately - even after a revert the problem
persists.
Because our logging for basebackups is quite weak, both for server and client
side, I needed to add a fair bit more debugging to figure it out:
pg_basebackup: wait to finish at 0.492
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: stream poll timeout 10.112
The problem is that there's just no implemented way to timely shutdown the WAL
streaming thread in pg_basebackup. The code in pg_basebackup.c says:
if (verbose)
pg_log_info("waiting for background process to finish streaming ...");
...
/*
* On Windows, since we are in the same process, we can just store the
* value directly in the variable, and then set the flag that says
* it's there.
*/
...
xlogendptr = ((uint64) hi) << 32 | lo;
InterlockedIncrement(&has_xlogendptr);
But just setting a variable doesn't do much if the thread is in
HandleCopyStream()->CopyStreamPoll()->select()
The only reason we ever succeed shutting down, without more WAL coming in, is
that pg_basebackup defaults to sending a status message every 10 seconds. At
which point the thread sees has_xlogendptr = true, and shuts down.
A test specific workaround would be to just add --status-interval=1 to
Cluster.pm::backup(). But that seems very unsatisfying.
I don't immediately see a solution for this, other than to add
StreamCtl->stop_event (mirroring ->stop_socket) and then convert
CopyStreamPoll() to use WaitForMultipleObjects(). Microsoft's select()
doesn't support pipes and there's no socketpair().
Any more straightforward ideas?
From a cursory look at history, it used to be that pg_basebackup had this
behaviour on all platforms, but it got fixed for other platforms in
7834d20b57a by Tom (assuming the problem wasn't present there).
Greetings,
Andres Freund
[1] https://postgr.es/m/20211231192528.wirwj4qaaw3ted5g%40alap3.anarazel.de
From | Date | Subject | |
---|---|---|---|
Next Message | Julien Rouhaud | 2022-01-16 09:39:38 | Re: Isn't wait_for_catchup slightly broken? |
Previous Message | Thomas Munro | 2022-01-16 09:19:30 | Re: XLogReadRecord() error in XlogReadTwoPhaseData() |