Corrupt index stopping autovacuum system wide

From: Aaron Pelz <aaronepelz(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Corrupt index stopping autovacuum system wide
Date: 2019-07-17 15:53:58
Message-ID: CAHqUQyf7dm-ABtRoF+_xfSBhCjHgKnniQ9QZ17b0mFW4q2UQ3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

Over the weekend we noticed that our max transaction IDs have been
continuously increasing - got an alert it passed 1B - and we believe that
no autovacuums were running for a period of about a month by looking at
pg_stat_user_tables. We had not updated any autovac tuning parameters over
that time period and many tables were very much over the threshold for
needing an autovac.

When investigating I located the table with the oldest transaction ID with:

SELECT c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age,
pg_size_pretty(pg_table_size(c.oid)) as table_size,
c.relkind,
c.relpersistence
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind in ('r', 't','m')
order by age desc limit 40;

I vacuumed that table manually with `vacuum freeze verbose table_xx` and
got this error:

INFO: aggressively vacuuming "public.table_xx"
INFO: scanned index "table_xx_pkey" to remove 168443 row versions
DETAIL: CPU: user: 0.13 s, system: 0.00 s, elapsed: 0.13 s
INFO: scanned index "table_xx_col_id" to remove 168443 row versions
DETAIL: CPU: user: 0.16 s, system: 0.00 s, elapsed: 0.16 s
ERROR: failed to re-find parent key in index "table_xx_col_idx" for
deletion target page 217

I replaced the index (create a new concurrently, delete the old
concurrently), vacuumed the table, and immediately autovacs started across
the system and our XIDs started falling. To me it looks like a *single*
corrupt index held up autovacuums across our entire server, even other in
other databases on the same server. Am I interpreting this correctly? Would
love guidance on diagnosing this type of thing and strategies for
preventing it.

Thanks,
Aaron

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Peter Geoghegan 2019-07-17 16:21:06 Re: Corrupt index stopping autovacuum system wide
Previous Message Tom Lane 2019-07-17 15:09:29 Re: Removing a key from jsonb is sloooow