Re: Introduce XID age and inactive timeout based replication slot invalidation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Peter Smith <smithpb2250(at)gmail(dot)com>, Ajin Cherian <itsajin(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Introduce XID age and inactive timeout based replication slot invalidation
Date: 2024-09-18 12:10:10
Message-ID: CAA4eK1LnVV2FzB4+kSY5m2yyG4sr94E19Ng6OCTFzMJQr57X0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 16, 2024 at 10:41 PM Bharath Rupireddy
<bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>
> Thanks for looking into this.
>
> On Mon, Sep 16, 2024 at 4:54 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > Why raise the ERROR just for timeout invalidation here and why not if
> > the slot is invalidated for other reasons? This raises the question of
> > what happens before this patch if the invalid slot is used from places
> > where we call ReplicationSlotAcquire(). I did a brief code analysis
> > and found that for StartLogicalReplication(), even if the error won't
> > occur in ReplicationSlotAcquire(), it would have been caught in
> > CreateDecodingContext(). I think that is where we should also add this
> > new error. Similarly, pg_logical_slot_get_changes_guts() and other
> > logical replication functions should be calling
> > CreateDecodingContext() which can raise the new ERROR. I am not sure
> > about how the invalid slots are handled during physical replication,
> > please check the behavior of that before this patch.
>
> When physical slots are invalidated due to wal_removed reason, the failure happens at a much later point for the streaming standbys while reading the requested WAL files like the following:
>
> 2024-09-16 16:29:52.416 UTC [876059] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000005 has already been removed
> 2024-09-16 16:29:52.416 UTC [872418] LOG: waiting for WAL to become available at 0/5002000
>
> At this point, despite the slot being invalidated, its wal_status can still come back to 'unreserved' even from 'lost', and the standby can catch up if removed WAL files are copied either by manually or by a tool/script to the primary's pg_wal directory. IOW, the physical slots invalidated due to wal_removed are *somehow* recoverable unlike the logical slots.
>
> IIUC, the invalidation of a slot implies that it is not guaranteed to hold any resources like WAL and XMINs. Does it also imply that the slot must be unusable?
>

If we can't hold the dead rows against xmin of the invalid slot, then
how can we make it usable even after copying the required WAL?

--
With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-09-18 12:12:46 Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN
Previous Message Jelte Fennema-Nio 2024-09-18 11:46:39 Re: Detailed release notes