datfrozenxid not dropping after vacuum

From: Matthew Tice <mjtice(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: datfrozenxid not dropping after vacuum
Date: 2021-09-01 20:34:48
Message-ID: CA+taBv9cXJkHUxeDC_rvd9BwB70kTCqxN8tcwq=zft9XV_--cw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

Starting this morning at 0830 local time I noticed that my
datfrozenxid starts moving past the `autovacuum_freeze_max_age` value
of 200000000. When we encountered this in the past the solution has
been to do one of the following:

1. This is related an error similar to
```
found xmin 2675436435 from before relfrozenxid 321165377
```
Where the solution has been to move the `pg_internal.init` file out of
the way and let Postgresql recreate it. Or;

2. A long-running transaction. Typically I'll just find the `idle in
transaction` transactions that have a `query_start` around when my
alarm went off notifying me when `datfrozenxid` breaches
`autovacuum_freeze_max_age`. Using a query similar to
```
SELECT pid, query_start, datname, usename, state, backend_xmin,
age(backend_xmin)
FROM pg_stat_activity
WHERE state = 'idle in transaction';
```

3. The autovacuum process seemed to be "stuck" on a particular table.
We would kill the pid of the autovacuum process.

The problem is that neither of these solutions have seemed to drop
`datfrozenxid` back down and there is one specific database in this
cluster that's holding onto it.

Using these queries from CrunchyData:

# Show oldest current xid
# WITH max_age AS (
SELECT 2000000000 as max_old_xid
, setting AS autovacuum_freeze_max_age
FROM pg_catalog.pg_settings
WHERE name = 'autovacuum_freeze_max_age' )
, per_database_stats AS (
SELECT datname
, m.max_old_xid::int
, m.autovacuum_freeze_max_age::int
, age(d.datfrozenxid) AS oldest_current_xid
FROM pg_catalog.pg_database d
JOIN max_age m ON (true)
WHERE d.datallowconn )
SELECT max(oldest_current_xid) AS oldest_current_xid
, max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS
percent_towards_wraparound
, max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float)))
AS percent_towards_emergency_autovac
FROM per_database_stats;

oldest_current_xid | percent_towards_wraparound |
percent_towards_emergency_autovac
--------------------+----------------------------+-----------------------------------
230935699 | 12 |
115

# Top 8 individual databases
SELECT datname
, age(datfrozenxid)
, current_setting('autovacuum_freeze_max_age')
FROM pg_database
ORDER BY 2 DESC LIMIT 8;

datname | age | current_setting
---------------------------+-----------+-----------------
siteservice | 230935699 | 200000000
coupon-ws | 217613246 | 200000000
contextchangestore | 211343280 | 200000000
template0 | 210351639 | 200000000
productmanager | 207876167 | 200000000
rhistory | 207876167 | 200000000
smsservice | 207876167 | 200000000
techservice | 205432524 | 200000000

That top database `siteservice` is the "problematic" one where a
manual vacuum won't coerce it to free up the `datfrozenxid`.

Looking at the tables in that database:

# SELECT c.oid::regclass
, age(c.relfrozenxid)
, pg_size_pretty(pg_total_relation_size(c.oid))
FROM pg_class c
JOIN pg_namespace n on c.relnamespace = n.oid
WHERE relkind IN ('r', 't', 'm')
AND n.nspname NOT IN ('pg_toast')
ORDER BY 2 DESC LIMIT 10;

oid | age | pg_size_pretty
--------------------+-----------+----------------
pg_database | 230935699 | 4264 kB
pg_proc | 93543215 | 976 kB
pg_collation | 93543215 | 560 kB
pg_attribute | 93543215 | 600 kB
pg_shdepend | 59515320 | 15 MB
pg_statistic | 53828900 | 464 kB
pg_subscription | 53172718 | 16 kB
pg_pltemplate | 53172718 | 56 kB
pg_authid | 53172718 | 8616 kB
pg_db_role_setting | 53172718 | 64 kB

I thought maybe it had to do with my replication slots somehow:

# select slot_name, slot_type, database, active, catalog_xmin,
restart_lsn, confirmed_flush_lsn from pg_replication_slots ;
slot_name | slot_type |
database | active | catalog_xmin | restart_lsn |
confirmed_flush_lsn
----------------------------------------+-----------+-------------------------------+--------+--------------+---------------+---------------------
dbs1db02 | physical |
| f | | |
dbs1db01 | physical |
| t | | 4D25/ACE6EE08 |
dbs1db03 | physical |
| t | | 4D25/ACE6EE08 |
dbs2db01 | physical |
| t | | 4D25/ACE6EE08 |
debezium_cmanager | logical | campaign-manager
| t | 2152258063 | 4D25/A421A6C8 | 4D25/ABC18C88
debezium_rservice | logical | retail-content-service
| t | 2152238060 | 4D25/8EC403B0 | 4D25/A6105DF8
debezium_partnerservice | logical | partnerservice | t |
2152238060 | 4D25/8EC403B0 | 4D25/A5446630
dbs1_dr | physical |
| f | | |

So I either restarted the physical standbys or I restarted the logical
connections (Debezium -> Kafka).

I'm also working on a:
# vacuumdb --all --jobs=10 --echo --analyze

But this is still running (and honestly I'm not really hopeful it's
going to do the trick).

My primary, read/write database is Postgresql 10.4 (CentOS 7) while my
standby databases have been patched to 10.17.

Any thoughts on this?

Thanks,Matt

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2021-09-01 21:00:29 Re: datfrozenxid not dropping after vacuum
Previous Message FOUTE K. Jaurès 2021-09-01 16:59:05 Re: PostgreSQL 12 - ERROR: invalid attribute number 2 for ad_user_pkey - Urgent