Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: logical replication: restart_lsn can go backwards (and more), seems broken since 9.4
Date: 2024-11-11 23:24:24
Message-ID: 36047d12-b943-4b7f-ad49-ebc7add2b3bc@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/11/24 23:41, Masahiko Sawada wrote:
> On Mon, Nov 11, 2024 at 6:17 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> If this analysis is correct, I think it's rather suspicious we don't
>> reset the candidate fields on restart. Can those "old" values ever be
>> valid? But I haven't tried resetting them.
>
> I had the same question. IIRC resetting them also fixes the
> problem[1]. However, I got a comment from Alvaro[2]:
>
> Hmm, interesting -- I was studying some other bug recently involving the
> xmin of a slot that had been invalidated and I remember wondering if
> these "candidate" fields were being properly ignored when the slot is
> marked not in use; but I didn't check. Are you sure that resetting them
> when the slot is released is the appropriate thing to do? I mean,
> shouldn't they be kept set while the slot is in use, and only reset if
> we destroy it?
>
> Which made me re-investigate the issue and thought that it doesn't
> necessarily need to clear these candidate values in memory on
> releasing a slot as long as we're carefully updating restart_lsn.

Not sure, but maybe it'd be useful to ask the opposite question. Why
shouldn't it be correct to reset the fields, which essentially puts the
slot into the same state as if it was just read from disk? That also
discards all these values, and we can't rely on accidentally keeping
something important info in memory (because if the instance restarts
we'd lose that).

But this reminds me that the patch I shared earlier today resets the
slot in the ReplicationSlotAcquire() function, but I guess that's not
quite correct. It probably should be in the "release" path.

> Which seems a bit efficient for example when restarting from a very
> old point. Of course, even if we reset them on releasing a slot, it
> would perfectly work since it's the same as restarting logical
> decoding with a server restart.

I find the "efficiency" argument a bit odd. It'd be fine if we had a
correct behavior to start with, but we don't have that ... Also, I'm not
quite sure why exactly would it be more efficient?

And how likely is this in practice? It seems to me that
performance-sensitive cases can't do reconnects very often anyway,
that's inherently inefficient. No?

> I think
> LogicalIncreaseRestartDecodingForSlot() should be fixed as it seems
> not to be working expectedly, but I could not have proof that we
> should either keep or reset them on releasing a slot.
>

Not sure. Chances are we need both fixes, if only to make
LogicalIncreaseRestartDecodingForSlot more like the other function.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-11-11 23:31:40 Re: POC: make mxidoff 64 bits
Previous Message Jacob Champion 2024-11-11 23:17:44 Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible