Re: Long running query causing XID limit breach

From: sud <suds1434(at)gmail(dot)com>
To: Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Long running query causing XID limit breach
Date: 2024-05-26 18:46:40
Message-ID: CAD=mzVUGzU3vEMp4AY17vG1xDkVjOCUhuXnLt3NnAB9jhSyfRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, May 26, 2024 at 11:18 PM Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
wrote:

> Each query on the replica has a backend_xmin. You can see that in
> pg_stat_activity. From that backend's perspective, tuples marked as deleted
> by any transaction greater or equal to backend_xmin are still needed. This
> does not depend on the table.
>
> Now, vacuum writes to the WAL up to which point it has vacuumed on the
> master. In pg_waldump this looks like so:
>
> PRUNE snapshotConflictHorizon: 774, nredirected: 0, ndead: 5, nunused: 0,
> redirected: [], dead: [2, 4, 6, 8, 10], unused: [], blkref #0: rel
> 1663/5/16430 blk 0
>
> That snapshotConflictHorizon is also a transaction id. If the backend_xmin
> of all backends running transactions in the same database (the 5 in 16
> 63/5/16430) -as the vacuum WAL record is greater than vacuum's
> snapshotConflictHorizon, then there is no conflict. If any of the
> backend_xmin's is less, then there is a conflict.
>
> This type of conflict is determined by just 2 numbers, the conflict
> horizon sent by the master in the WAL, and the minimum of all
> backend_xmins. For your case this means a long running transaction querying
> table t1 might have a backend_xmin of 223. On the master update and delete
> operations happen on table T2. Since all the transactions on the master are
> fast, when vacuum hits T2, the minimum of all backend_xmins on the master
> might already be 425. So, garbage left over by all transactions up to 424
> can be cleaned up. Now that cleanup record reaches the replica. It compares
> 223>425 which is false. So, there is a conflict. Now the replica can wait
> until its own horizon reaches 425 or it can kill all backends with a lower
> backend_xmin.
>
> As I understand, hot_standby_feedback does not work for you. Not sure if
> you can run the query on the master? That would resolve the issues but
> might generate the same bloat on the master as hot_standby_feedback.
> Another option I can see is to run long running queries on a dedicated
> replica with max_standby_streaming_delay set to infinity or something large
> enough. If you go that way, you could also fetch the WAL from your
> WAL archive instead of replicating from the master. That way the replica
> has absolutely no chance to affect the master.
>
>
Thank you so much.

Would you agree that we should have two standby, one with default
max_standby_streaming_delay (say 10 sec ) which will be mainly used as high
availability and thus will be having minimal lag. and another standby with
max_standby_streaming_delay as "-1" i.e. it will wait indefinitely for the
SELECT queries to finish without caring about the lag, which will be
utilized for the long running SELECT queries.

And keep the hot_standby_feedback as ON for the first standby which is used
as HA/high availability. And keep the hot_standby_feedback as OFF for the
second standby which is utilized for long running SELECT queries, so that
primary won't be waiting for the response/feedback from this standby to
vacuum its old transactions and that will keep the transaction id wrap
around issue from not happening because of the Read/Select queries on any
of the standby.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Torsten Förtsch 2024-05-26 19:25:26 Re: Long running query causing XID limit breach
Previous Message Torsten Förtsch 2024-05-26 17:48:25 Re: Long running query causing XID limit breach