Re: Instability with incremental backup tests (pg_combinebackup, 003_timeline.pl)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Instability with incremental backup tests (pg_combinebackup, 003_timeline.pl)
Date: 2024-08-21 12:58:31
Message-ID: CA+TgmobXg7gsJohg3Z_Wz7Vdc2va+4WMKHjwDH=dggo9pV8GgQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 6, 2024 at 1:49 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> dikkop has reported a failure with the regression tests of pg_combinebackup:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-08-04%2010%3A04%3A51
>
> That's in the test 003_timeline.pl, from dc212340058b:
> # Failed test 'incremental backup from node1'
> # at t/003_timeline.pl line 43.
>
> The node is extremely slow, so perhaps bumping up the timeout would be
> fine enough in this case (did not spend time analyzing it). I don't
> think that this has been discussed, but perhaps I just missed a
> reference to it and the incremental backup thread is quite large.

I just noticed, rather belatedly, that this thread is on the open
items list. This seems to be the cause of the failure:

2024-08-04 12:46:34.986 UTC [4951:15] 003_timeline.pl STATEMENT:
START_REPLICATION SLOT "pg_basebackup_4951" 0/4000000 TIMELINE 1
2024-08-04 12:47:34.987 UTC [4951:16] 003_timeline.pl LOG:
terminating walsender process due to replication timeout

wal_sender_timeout is 60s by default, so that tracks. The command that
provokes this failure is:

pg_basebackup -D
/mnt/data/buildfarm/buildroot/HEAD/pgsql.build/src/bin/pg_combinebackup/tmp_check/t_003_timeline_node1_data/backup/backup2
--no-sync -cfast --incremental
/mnt/data/buildfarm/buildroot/HEAD/pgsql.build/src/bin/pg_combinebackup/tmp_check/t_003_timeline_node1_data/backup/backup1/backup_manifest

All we're doing here is taking an incremental backup of 1-table
database that had 1 row at the time of the full backup and has had 1
more row inserted since then. On my system, the last time I ran this
regression test, this step completed in 410ms. It shouldn't be
expensive. So I'm inclined to chalk this up to the machine not having
enough resources. The only thing that I don't really understand is why
this particular test would fail vs. anything else. We have a bunch of
tests that take backups. A possibly important difference here is that
this one is an incremental backup, so it would need to read WAL
summary files from the beginning of the full backup to the beginning
of the current backup and combine them into one super-summary that it
could then use to decide what to include in the incremental backup.
However, since this is an artificial example with just 1 insert
between the full and the incremental, it's hard to imagine that being
expensive, unless there's some low-probability bug that makes it go
into an infinite loop or chew up a million CPU cycles or something.
That's not impossible, but given the discussion between you and Tomas,
I'm kinda hoping it was just a hardware issue.

Barring objections or other similar trouble reports, I think we should
just close out this open item.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-08-21 13:01:08 Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
Previous Message Thomas Munro 2024-08-21 12:51:53 Re: Cleaning up threading code