Re: Long running query causing XID limit breach

From: Muhammad Salahuddin Manzoor <salahuddin(dot)m(at)bitnine(dot)net>
To: sud <suds1434(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Long running query causing XID limit breach
Date: 2024-05-23 06:51:28
Message-ID: CAKD7CDk6yU0mQFKNhh8A0i2CCwWDRraWOQmJjk9xCWoghM0xaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Greetings,

Yes, monitoring and alerting for VACUUM operations are crucial.

Track VACUUM Duration and Success:

SELECT pid, state, query_start, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE query LIKE 'VACUUM%'
ORDER BY duration DESC;

Check Autovacuum Activity:

SELECT relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count
FROM pg_stat_all_tables
WHERE relname = 'your_table_name';

Log and Monitor VACUUM Failures:

log_autovacuum_min_duration = 0 # Log all autovacuum actions
log_min_messages = 'WARNING' # Ensure warnings and above are logged

Use tools like pgBadger to analyze PostgreSQL logs and identify any issues
with autovacuum operations.

Set Up Alerts for Long-Running VACUUMs:

Use monitoring tools such as pgMonitor, Nagios, Prometheus with Grafana, or
New Relic to set up alerts for long-running VACUUM processes.

Yes, your understanding is correct. In a high-transaction environment like
yours, long-running transactions, including legitimate reporting queries,
can hold back the advancement of the transaction ID (XID) horizon. This can
prevent VACUUM from properly cleaning up old XIDs, leading to the risk of
XID wraparound and potential system failure.

Use some Mitigation Strategies to handle long running quires like Set
Transaction Timeouts, Monitor and Kill Long-Running Queries, Optimize Query
Performance, Schedule heavy reporting queries during periods of lower
transactional activity.

*Salahuddin (살라후딘*
*)*

On Thu, 23 May 2024 at 11:25, sud <suds1434(at)gmail(dot)com> wrote:

> Also,if i am getting it correct, it means we should not run any
> transaction (even if it's legitimate one like for e.g. a big Reporting
> "SELECT" query) beyond 10hrs, as that will end up consuming 10*200million
> XID per hour= 2billion XID limit saturation and thus causing system
> failure. Hope my understanding is correct here.
>
> On Thu, May 23, 2024 at 11:41 AM sud <suds1434(at)gmail(dot)com> wrote:
>
>>
>> On Thu, May 23, 2024 at 10:42 AM Muhammad Salahuddin Manzoor <
>> salahuddin(dot)m(at)bitnine(dot)net> wrote:
>>
>>> Greetings,
>>>
>>> Running `VACUUM table_name;` on a partitioned table will vacuum each
>>> partition individually, not the whole table as a single unit.
>>>
>>> Yes, running `VACUUM table_name;` frequently on tables or partitions
>>> with heavy DML is recommended.
>>>
>>> Regular `VACUUM` does not lock the table for reads or writes, so it
>>> won't disrupt ongoing 24/7 data operations.
>>>
>>> "optimize autovacuum"
>>> Yes. Adjust following parameters as per your system/environment
>>> requirement
>>> autovacuum_max_workers,
>>> autovacuum_freeze_max_age ,
>>> autovacuum_vacuum_cost_delay
>>>
>>> Following need to be first tested thoroughly in a test environment.
>>> Recommended Alert Threshold
>>> Alert at 50% Usage: Set the alert threshold at 1 billion used XIDs. This
>>> provides a significant buffer, giving you ample time to take corrective
>>> action before reaching the critical limit.
>>>
>>> Calculation Rationale
>>> Daily XID Usage: Approximately 4 billion rows per day implies high XID
>>> consumption.
>>> Buffer Time: At 1 billion XIDs, you would still have 1 billion XIDs
>>> remaining, giving you roughly 12 hours to address the issue if your system
>>> consumes 200 million XIDs per hour.
>>>
>>>
>>>
>> Thank you so much. That helps.
>> So apart from setting these alerts on "Maximumusedtxnids" and making the
>> vacuum optimized by tweaking above parameters, should we also need to have
>> monitoring in place to ensure the Vacuum is not taking longer as compared
>> to its normal runtime and also if it's getting blocked/failed by something?
>> Like for example in our case where the select query was running longer , so
>> the vacuum must not be able to succeed every time it attempts, so is it
>> really worth having that level of alerting? and also how can we get an
>> idea regarding if the vacuum is not succeeding or getting failed etc to
>> avoid such upcoming issues?
>>
>>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Peter J. Holzer 2024-05-23 07:06:49 expected authentication request from server, but received H
Previous Message sud 2024-05-23 06:25:37 Re: Long running query causing XID limit breach