RE: Introduce XID age and inactive timeout based replication slot invalidation

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RE: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-12-26 06:02:20
Message-ID: OS0PR01MB571666018400F782BD1FDD1C940D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, December 24, 2024 8:57 PM Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com> wrote:

Hi,

> Yesterday I got a strange set of test errors, probably somehow related to
> that patch. It happened on changed master branch (based on
> d96d1d5152f30d15678e08e75b42756101b7cab6) but I don't think my changes were
> affecting it.
>
> My setup is a little bit tricky: Windows 11 run WSL2 with Ubuntu, meson.
>
> So, `recovery ` suite started failing on:
>
> 1) at /src/test/recovery/t/http://019_replslot_limit.pl line 530.
> 2) at /src/test/recovery/t/http://040_standby_failover_slots_sync.pl line
> 198.
>
> It was failing almost every run, one test or another. I was lurking around
> for about 10 min, and..... it just stopped failing. And I can't reproduce it
> anymore.
>
> But I have logs of two fails. I am not sure if it is helpful, but decided to
> mail them here just in case.

Thanks for reporting the issue.

After checking the log, I think the failure is caused by the unexpected
behavior of the local system clock.

It's clear from the '019_replslot_limit_primary4.log'[1] that the clock went
backwards which makes the slot's inactive_since go backwards as well. That's
why the last testcase didn't pass.

And for 040_standby_failover_slots_sync, we can see that the clock of standby
lags behind that of the primary, which caused the inactive_since of newly synced
slot on standby to be earlier than the one on the primary.

So, I think it's not a bug in the committed patch but an issue in the testing
environment. Besides, since we have not seen such failures on BF, I think it
may not be necessary to improve the testcases.

[1]
2024-12-24 01:37:19.967 CET [161409] sub STATEMENT: START_REPLICATION SLOT "lsub4_slot" LOGICAL 0/0 (proto_version '4', streaming 'parallel', origin 'any', publication_names '"pub"')
...
2024-12-24 01:37:20.025 CET [161447] 019_replslot_limit.pl LOG: statement: SELECT '0/30003D8' <= replay_lsn AND state = 'streaming'
...
2024-12-24 01:37:19.388 CET [161097] LOG: received fast shutdown request

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ilia Evdokimov 2024-12-26 09:40:52 Removing unused parameter in compute_expr_stats
Previous Message Michael Paquier 2024-12-26 05:34:48 Re: An improvement of ProcessTwoPhaseBuffer logic