From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Fix 035_standby_logical_decoding.pl race conditions |
Date: | 2025-04-07 09:46:07 |
Message-ID: | CAA4eK1Ku9MdGd1L74N=YKjSvB_9-2eQLsXz-LdPS1Cfnrs8TRw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 3, 2025 at 3:15 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Thu, Apr 3, 2025 at 12:29 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> >
> > On Thu, Apr 03, 2025 at 05:34:10AM +0000, Hayato Kuroda (Fujitsu) wrote:
> > > Dear Bertrand, Amit,
> > >
> > > > > I do prefer v5-PG17-2 as it is "closer" to HEAD. That said, I think that we should
> > > > > keep the slots active and only avoid doing the checks for them (they are
> > > > invalidated
> > > > > that's fine, they are not that's fine too).
> > > > >
> > > >
> > > > I don't mind doing that, but there is no benefit in making slots
> > > > active unless we can validate them. And we will end up adding some
> > > > more checks, as in function check_slots_conflict_reason without any
> > > > advantage.
> >
> > I think that there is advantage. The pros are:
> >
> > - the test would be closer to HEAD from a behavioural point of view
> > - it's very rare to hit the corner cases: so the test would behave the same
> > as on HEAD most of the time (and when it does not that would not hurt as the
> > checks are nor done)
> > - Kuroda-San's patch removes "or die "replication slot stats of vacuum_full_activeslot not updated"
> > while keeping the slot active is able to keep it (should the slot being invalidated
> > or not). But more on that in the comment === 1 below.
> >
> > > I feel Kuroda-San's second patch is simple, and we have
> > > > fewer chances to make mistakes and easy to maintain in the future as
> > > > well.
> >
> > Yeah maybe but the price to pay is to discard the pros above. That said, I'm also
> > fine with Kuroda-San's patch if both of you feel that it's better.
> >
> > > I have concerns for Bertrand's patch that it could introduce another timing
> > > issue. E.g., if the activated slots are not invalidated, dropping slots is keep
> > > being activated so the dropping might be fail.
> >
> > Yeah, but the drop is done with "$node_standby->psql" so that the test does not
> > produce an error. It would produce an error should we use "$node_standby->safe_psql"
> > instead.
> >
> > > I did not reproduce this but
> > > something like this can happen if we activate slots.
> >
> > You can see it that way (+ reproducer.txt):
> >
> > "
> > + my $bdt = $node_standby->safe_psql('postgres', qq[SELECT * from pg_replication_slots]);
> > + note "BDT: $bdt";
> > +
> > $node_standby->psql('postgres',
> > qq[SELECT pg_drop_replication_slot('$inactive_slot')]);
> > "
> >
> > You'd see the slot being active and the "$node_standby->psql" not reporting
> > any error.
> >
>
> Hmm, but adding some additional smarts also makes this test less easy
> to backpatch. I see your points related to the benefits, but I still
> mildly prefer to go with the lesser changes approach for backbranches
> patch. Normally, we don't enhance backbranches code without making
> equivalent changes in HEAD, so adding some new bugs only in
> backbranches has a lesser chance.
>
Bertrand, do you agree with the fewer changes approach (where active
slots won't be tested) for backbranches? I think now that we have
established that the vacuum full test is also prone to failure due to
race condition in the test, this is the only remaining open point.
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Etsuro Fujita | 2025-04-07 09:50:40 | Re: Typo in comment for pgstat_database_flush_cb() |
Previous Message | Amit Kapila | 2025-04-07 09:37:08 | Re: BUG #18815: Logical replication worker Segmentation fault |