Re: A failure in t/038_save_logical_slots_shutdown.pl

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: A failure in t/038_save_logical_slots_shutdown.pl
Date: 2024-01-11 05:52:29
Message-ID: CAA4eK1LqdozFHNphDzRUUfAwyYxKONPywO_28p-bvZjHr1CmOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 10, 2024 at 2:08 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> I've been observing a failure in t/038_save_logical_slots_shutdown.pl
> of late on my developer system:
>
> t/038_save_logical_slots_shutdown.pl .. 1/?
> # Failed test 'Check that the slot's confirmed_flush LSN is the same
> as the latest_checkpoint location'
> # at t/038_save_logical_slots_shutdown.pl line 35.
> # Looks like you failed 1 test of 2.
> t/038_save_logical_slots_shutdown.pl .. Dubious, test returned 1
> (wstat 256, 0x100)
> Failed 1/2 subtests
>
> I did a quick analysis of the failure and commit
> https://github.com/postgres/postgres/commit/e0b2eed047df9045664da6f724cb42c10f8b12f0
> that introduced this test. I think the issue is that the slot's
> confirmed_flush LSN (0/1508000) and shutdown checkpoint LSN
> (0/1508018) are not the same:
>
> tmp_check/log/038_save_logical_slots_shutdown_pub.log:
>
> 2024-01-10 07:55:44.539 UTC [57621] sub LOG: starting logical
> decoding for slot "sub"
> 2024-01-10 07:55:44.539 UTC [57621] sub DETAIL: Streaming
> transactions committing after 0/1508000, reading WAL from 0/1507FC8.
> 2024-01-10 07:55:44.539 UTC [57621] sub STATEMENT: START_REPLICATION
> SLOT "sub" LOGICAL 0/0 (proto_version '4', origin 'any',
> publication_names '"pub"')
>
> ubuntu:~/postgres$ pg17/bin/pg_controldata -D
> src/test/recovery/tmp_check/t_038_save_logical_slots_shutdown_pub_data/pgdata/
> Database cluster state: in production
> pg_control last modified: Wed Jan 10 07:55:44 2024
> Latest checkpoint location: 0/1508018
> Latest checkpoint's REDO location: 0/1508018
>
> But the tests added by t/038_save_logical_slots_shutdown.pl expects
> both LSNs to be same:
>
> sub compare_confirmed_flush
> {
> # Is it same as the value read from log?
> ok( $latest_checkpoint eq $confirmed_flush_from_log,
> "Check that the slot's confirmed_flush LSN is the same as the
> latest_checkpoint location"
> );
>
> I suspect that it's quite not right to expect the slot's
> confirmed_flush and latest checkpoint location to be same in the test.
>

As per my understanding, the reason we expect them to be the same is
because we ensure that during shutdown, the walsender sends all the
WAL just before shutdown_checkpoint and the confirm_flush also points
to the end of WAL record before shutodwn_checkpoint. So, the next
starting location should be of shutdown_checkpoint record which should
ideally be the same. Do you see this failure consistently?

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-01-11 05:55:44 Re: A recent message added to pg_upgade
Previous Message Bharath Rupireddy 2024-01-11 05:41:27 Re: introduce dynamic shared memory registry