Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

From: Arjun Ranade <ranade(at)nodalexchange(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Date: 2017-11-02 14:44:37
Message-ID: CANrrCRxzFYtJLmZSkpUVFh_2ZUcGnsO2nh3Cg78t2OLQEwfzGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Tom,

I am positive that there are no prepared transactions, certainly none that
would persist past a DB shutdown and the single user mode work I did.

However, your point about the replication slots seems to be valid. Also, a
leaking replication slot seems more likely since it was a replication slot
issue with pglogical that caused the DB to crash before it was restarted a
few months ago.

Below are the items I got from querying the replication view. It seems
that you are correct,
https://www.postgresql.org/docs/9.4/static/catalog-pg-replication-slots.html
says that 'catalog_xmin' is "The oldest transaction affecting the system
catalogs that this slot needs the database to retain. VACUUM cannot remove
catalog tuples deleted by any later transaction."

db1=# select * from pg_replication_slots;
slot_name | plugin |
slot_type | datoid | database | active | xmin | catalog_xmin |
restart_lsn
-----------------------------------------------+------------------+-----------+--------+----------+--------+-----------+--------------+--------------
repmgr_slot_3 | |
physical | | | t | 252861646 | |
157/E1755E90
repmgr_slot_4 | |
physical | | | t | 324101780 | *162630233* |
157/E1755E90
repmgr_slot_2 | |
physical | | | t | | |
157/E1755E90
pgl_tcs_dadfbc8d5_tcspg01_large_tab_ch356c67c | pglogical_output |
logical | 21705 | prod | t | | 324101794 |
157/E16D62D8
pgl_tcs_dadfbc8d5_tcspg01_all_tables | pglogical_output |
logical | 21705 | prod | t | | 324101794 |
157/E16D62D8
(5 rows)

db1=# SELECT datname, datfrozenxid FROM pg_database;
datname | datfrozenxid
-----------+--------------
template1 | *162630233*
template0 | *162630233*
repmgr_db | 138378717
postgres | 138378717
rundeck | 138378717
db1 | *162630233*
(6 rows)

On Wed, Nov 1, 2017 at 10:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Arjun Ranade <ranade(at)nodalexchange(dot)com> writes:
> > So we had a maintenance window scheduled today and I restarted the DB.
> > When it came back up, it reported the same toast error when vacuum'ing
> > pg_statistic. It also reported the "oldest xmin is far in the past"
> error.
>
> > I even brought the DB into single user mode and tried to manually vacuum
> > pg_statistic but got the same error. I also tried to TRUNCATE
> pg_statistic
> > in single user mode which failed (see below).
>
> > At this point, I am not sure what to make of the issue. The DB is still
> > complaining about old XMIN even when there are no users connected to it
> > after a fresh restart:
>
> Clearly, something is holding back the xmin horizon, and if the behavior
> persisted past a DB restart then it surely isn't a regular transaction.
>
> Are you *really* sure there are no prepared transactions?
>
> It now occurs to me that there's one other mechanism that could hold back
> the xmin horizon, and that is replication slots. Looking into the
> pg_replication_slots view should show you what's up there. I'm not
> an expert on replication slots, but I do know that they were new as of
> 9.4. I am thinking maybe your replication setup has "leaked" a slot
> that's not being used but is still preventing global xmin from advancing.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Arjun Ranade 2017-11-02 14:52:23 Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Previous Message Stephen Frost 2017-11-02 11:40:29 Re: Bad recovery: no pg_xlog/RECOVERYXLOG