From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | RE: Introduce XID age and inactive timeout based replication slot invalidation |
Date: | 2024-12-26 06:02:20 |
Message-ID: | OS0PR01MB571666018400F782BD1FDD1C940D2@OS0PR01MB5716.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tuesday, December 24, 2024 8:57 PM Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> wrote:
Hi,
> Yesterday I got a strange set of test errors, probably somehow related to
> that patch. It happened on changed master branch (based on
> d96d1d5152f30d15678e08e75b42756101b7cab6) but I don't think my changes were
> affecting it.
>
> My setup is a little bit tricky: Windows 11 run WSL2 with Ubuntu, meson.
>
> So, `recovery ` suite started failing on:
>
> 1) at /src/test/recovery/t/http://019_replslot_limit.pl line 530.
> 2) at /src/test/recovery/t/http://040_standby_failover_slots_sync.pl line
> 198.
>
> It was failing almost every run, one test or another. I was lurking around
> for about 10 min, and..... it just stopped failing. And I can't reproduce it
> anymore.
>
> But I have logs of two fails. I am not sure if it is helpful, but decided to
> mail them here just in case.
Thanks for reporting the issue.
After checking the log, I think the failure is caused by the unexpected
behavior of the local system clock.
It's clear from the '019_replslot_limit_primary4.log'[1] that the clock went
backwards which makes the slot's inactive_since go backwards as well. That's
why the last testcase didn't pass.
And for 040_standby_failover_slots_sync, we can see that the clock of standby
lags behind that of the primary, which caused the inactive_since of newly synced
slot on standby to be earlier than the one on the primary.
So, I think it's not a bug in the committed patch but an issue in the testing
environment. Besides, since we have not seen such failures on BF, I think it
may not be necessary to improve the testcases.
[1]
2024-12-24 01:37:19.967 CET [161409] sub STATEMENT: START_REPLICATION SLOT "lsub4_slot" LOGICAL 0/0 (proto_version '4', streaming 'parallel', origin 'any', publication_names '"pub"')
...
2024-12-24 01:37:20.025 CET [161447] 019_replslot_limit.pl LOG: statement: SELECT '0/30003D8' <= replay_lsn AND state = 'streaming'
...
2024-12-24 01:37:19.388 CET [161097] LOG: received fast shutdown request
Best Regards,
Hou zj
From | Date | Subject | |
---|---|---|---|
Next Message | Ilia Evdokimov | 2024-12-26 09:40:52 | Removing unused parameter in compute_expr_stats |
Previous Message | Michael Paquier | 2024-12-26 05:34:48 | Re: An improvement of ProcessTwoPhaseBuffer logic |