Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

From: Arjun Ranade <ranade(at)nodalexchange(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Date: 2017-11-02 14:52:23
Message-ID: CANrrCRwrXQaxJRCtw0i8Kn3wcV_9EA+np=oBmsSfvjH1ieisXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

After dropping the replication slot, VACUUM FULL runs fine now and no
longer reports the "oldest xmin is far in the past"

Thank you so much for your help Tom!

Arjun

On Thu, Nov 2, 2017 at 10:44 AM, Arjun Ranade <ranade(at)nodalexchange(dot)com>
wrote:

> Hi Tom,
>
> I am positive that there are no prepared transactions, certainly none that
> would persist past a DB shutdown and the single user mode work I did.
>
> However, your point about the replication slots seems to be valid. Also,
> a leaking replication slot seems more likely since it was a replication
> slot issue with pglogical that caused the DB to crash before it was
> restarted a few months ago.
>
> Below are the items I got from querying the replication view. It seems
> that you are correct, https://www.postgresql.org/
> docs/9.4/static/catalog-pg-replication-slots.html says that
> 'catalog_xmin' is "The oldest transaction affecting the system catalogs
> that this slot needs the database to retain. VACUUM cannot remove catalog
> tuples deleted by any later transaction."
>
> db1=# select * from pg_replication_slots;
> slot_name | plugin | slot_type | datoid | database | active | xmin | catalog_xmin | restart_lsn
> -----------------------------------------------+------------------+-----------+--------+----------+--------+-----------+--------------+--------------
> repmgr_slot_3 | | physical | | | t | 252861646 | | 157/E1755E90
> repmgr_slot_4 | | physical | | | t | 324101780 | *162630233* | 157/E1755E90
> repmgr_slot_2 | | physical | | | t | | | 157/E1755E90
> pgl_tcs_dadfbc8d5_tcspg01_large_tab_ch356c67c | pglogical_output | logical | 21705 | prod | t | | 324101794 | 157/E16D62D8
> pgl_tcs_dadfbc8d5_tcspg01_all_tables | pglogical_output | logical | 21705 | prod | t | | 324101794 | 157/E16D62D8
> (5 rows)
>
> db1=# SELECT datname, datfrozenxid FROM pg_database;
> datname | datfrozenxid
> -----------+--------------
> template1 | *162630233*
> template0 | *162630233*
> repmgr_db | 138378717
> postgres | 138378717
> rundeck | 138378717
> db1 | *162630233*
> (6 rows)
>
>
>
>
> On Wed, Nov 1, 2017 at 10:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Arjun Ranade <ranade(at)nodalexchange(dot)com> writes:
>> > So we had a maintenance window scheduled today and I restarted the DB.
>> > When it came back up, it reported the same toast error when vacuum'ing
>> > pg_statistic. It also reported the "oldest xmin is far in the past"
>> error.
>>
>> > I even brought the DB into single user mode and tried to manually vacuum
>> > pg_statistic but got the same error. I also tried to TRUNCATE
>> pg_statistic
>> > in single user mode which failed (see below).
>>
>> > At this point, I am not sure what to make of the issue. The DB is still
>> > complaining about old XMIN even when there are no users connected to it
>> > after a fresh restart:
>>
>> Clearly, something is holding back the xmin horizon, and if the behavior
>> persisted past a DB restart then it surely isn't a regular transaction.
>>
>> Are you *really* sure there are no prepared transactions?
>>
>> It now occurs to me that there's one other mechanism that could hold back
>> the xmin horizon, and that is replication slots. Looking into the
>> pg_replication_slots view should show you what's up there. I'm not
>> an expert on replication slots, but I do know that they were new as of
>> 9.4. I am thinking maybe your replication setup has "leaked" a slot
>> that's not being used but is still preventing global xmin from advancing.
>>
>> regards, tom lane
>>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Tom Lane 2017-11-02 14:57:43 Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Previous Message Arjun Ranade 2017-11-02 14:44:37 Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?