From: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com> |
---|---|
To: | 'Amit Kapila' <amit(dot)kapila16(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "andrew(at)dunslane(dot)net" <andrew(at)dunslane(dot)net> |
Subject: | RE: Random pg_upgrade test failure on drongo |
Date: | 2024-01-09 05:49:26 |
Message-ID: | TY3PR01MB98894D8BE99AE53217C96C0AF56A2@TY3PR01MB9889.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear Amit, Alexander,
> > We get the effect discussed when the background writer process decides to
> > flush a file buffer for pg_largeobject during stage 1.
> > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> > the result must be the same.)
> > And another important factor is shared_buffers = 1MB (set during the test).
> > With the default setting of 128MB I couldn't see the failure.
> >
> > It can be reproduced easily (on old Windows versions) just by running
> > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> > default cluster)).
> > If an old cluster contains dozen of databases, this increases the failure
> > probability significantly (with 10 additional databases I've got failures
> > on iterations 4, 1, 6).
> >
>
> I don't have an old Windows environment to test but I agree with your
> analysis and theory. The question is what should we do for these new
> random BF failures? I think we should set bgwriter_lru_maxpages to 0
> and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> fix as part of this doesn't sound reasonable because this is an
> existing problem and there seems to be another patch by Thomas that
> probably deals with the root cause of the existing problem [1] as
> pointed out by you.
>
> [1] - https://commitfest.postgresql.org/40/3951/
Based on the suggestion by Amit, I have created a patch with the alternative
approach. This just does GUC settings. The reported failure is only for
003_logical_slots, but the patch also includes changes for the recently added
test, 004_subscription. IIUC, there is a possibility that 004 would fail as well.
Per our understanding, this patch can stop random failures. Alexander, can you
test for the confirmation?
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
Attachment | Content-Type | Size |
---|---|---|
0001-Add-GUC-settings-to-avoid-random-failures-on-Windows.patch | application/octet-stream | 2.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2024-01-09 05:56:34 | Re: Emit fewer vacuum records by reaping removable tuples during pruning |
Previous Message | jian he | 2024-01-09 05:33:33 | Re: SQL:2011 application time |