Long running query causing XID limit breach

From: sud <suds1434(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Long running query causing XID limit breach
Date: 2024-05-22 21:16:31
Message-ID: CAD=mzVXR3GjM0vcthMBwEdbOKqSKcv8oojSS9coczWRi9BRYTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello ,
It's RDS postgres version 15.4. We suddenly saw the
"MaximumUsedTransactionIDs" reach to ~1.5billion and got alerted by team
members who mentioned the database is going to be in shutdown/hung if this
value reaches to ~2billion and won't be able to serve any incoming
transactions. It was a panic situation.

I have heard of it before , because of the way postgres works and the XID
being a datatype of length 32 bit integer can only represent (2^32)/2=~2
billion transactions. However, as RDS performs the auto vacuum , we thought
that we need not worry about this issue. But it seems we were wrong. And we
found one adhoc "SELECT '' query was running on the reader instance since
the last couple of days and when that was killed, the max xid
(MaximumUsedTransactionIDs) dropped to 50million immediately.

So I have few questions,

1)This system is going to be a 24/7 up and running system which will
process ~500million business transactions/day in future i.e. ~4-5billion
rows/day inserted across multiple tables each day. And as I understand each
row will have XID allocated. So in that case , does it mean that, we will
need (5billion/24)=~200million XID/hour and thus , if any such legitimate
application "SELECT" query keeps running for ~10 hours (and thus keep the
historical XID alive) , then it can saturate the
"MaximumUsedTransactionIDs" and make the database standstill in
2billion/200million=~10hrs. Is this understanding correct? Seems we are
prone to hit this limit sooner going forward.

2)We have some legitimate cases where the reporting queries can run for
5-6hrs. So in such cases if the start of this SELECT query happen at 100th
XID on table TAB1, then whatever transactions happen after that time,
across all other tables(table2, table3 etc) in the database won't get
vacuum until that SELECT query on table1 get vacuumed(as database will try
to keep that same 100th XID image) and the XID will just keep incrementing
for new transaction, eventually reaching the max limit. Is my understanding
correct here?

3)Although RDS does the auto vacuum by default. but should we also consider
doing manual vacuum without impacting ongoing transactions? Something as
below options
vacuum freeze tab1;
vacuum freeze;
vacuum;
vacuum analyze tab1;
vacuum tab1;

4)Had worked in past in oracle database where the similar transaction
identifier is called as "system change number" , but never encountered that
being exhausted and also there it used to have UNDO record and if a SELECT
query needs anything beyond certain limit(set undo_retention parameter) the
select query used to fail with snapshot too old error but not impacting any
write transactions. But in postgres it seems nothing like that happens and
every "Select query" will try to run till its completion without any such
failure, until it gets skilled by someone. Is my understanding correct?

And in that case, It seems we have to mandatorily set "statement_timeout"
to some value e.g. 4hrs(also i am not seeing a way to set it for any
specific user level, so it will be set for all queries including
application level) and also "idle_in_transaction_session_timeout" to
5minutes, even on all the prod and non prod databases, to restrict the long
running transactions/queries and avoid such issues in future. Correct me if
I'm wrong.

Regards
Sud

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ken Tanzer 2024-05-22 21:21:22 Re: Finding "most recent" using daterange
Previous Message Tom Lane 2024-05-22 20:53:53 Re: Hash join and picking which result set to build the hash table with.