Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

From: Arjun Ranade <ranade(at)nodalexchange(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Date: 2017-11-02 01:49:20
Message-ID: CANrrCRy1_mu7A31Tas+-oKWcYK4kpNK8wTFW7VWVq+5yj2uApQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

So we had a maintenance window scheduled today and I restarted the DB.
When it came back up, it reported the same toast error when vacuum'ing
pg_statistic. It also reported the "oldest xmin is far in the past" error.

I even brought the DB into single user mode and tried to manually vacuum
pg_statistic but got the same error. I also tried to TRUNCATE pg_statistic
in single user mode which failed (see below).

At this point, I am not sure what to make of the issue. The DB is still
complaining about old XMIN even when there are no users connected to it
after a fresh restart:

2017-11-01 21:45:26.838 EDT pid=14654 database= user= rhost= tid=11/39
sessionid=59fa7847.393e WARNING: oldest xmin is far in the past
2017-11-01 21:45:26.838 EDT pid=14654 database= user= rhost= tid=11/39
sessionid=59fa7847.393e HINT: Close open transactions soon to avoid
wraparound problems.

If we cannot understand what is happening, we will likely restore from a
backup, rebuild all of our standbys/logical replicas and start fresh. It
would be too costly for the db to shutdown during a workday due to the
wraparound issue.

Any insight you can provide would be appreciated.

Thanks,
AR

Relevant log output from single user mode:

$ /usr/pgsql-9.4/bin/postmaster --single -D /var/lib/pgsql/9.4/data prod

PostgreSQL stand-alone backend 9.4.14
backend> vacuum full pg_statistic;
2017-11-01 21:35:38.781 EDT pid=12732 database= user= rhost= tid=1/3
sessionid=59fa765f.31bc ERROR: missing chunk number 0 for toast value
30382746 in pg_toast_2619
2017-11-01 21:35:38.783 EDT pid=12732 database= user= rhost= tid=1/3
sessionid=59fa765f.31bc STATEMENT: vacuum full pg_statistic;

backend> reindex table pg_statistic;
backend> vacuum full pg_statistic;
2017-11-01 21:36:43.602 EDT pid=12732 database= user= rhost= tid=1/7
sessionid=59fa765f.31bc ERROR: missing chunk number 0 for toast value
30382746 in pg_toast_2619
2017-11-01 21:36:43.602 EDT pid=12732 database= user= rhost= tid=1/7
sessionid=59fa765f.31bc STATEMENT: vacuum full pg_statistic;

backend> truncate table pg_statistic;
2017-11-01 21:37:44.867 EDT pid=12732 database= user= rhost= tid=1/14
sessionid=59fa765f.31bc ERROR: permission denied: "pg_statistic" is a
system catalog
2017-11-01 21:37:44.867 EDT pid=12732 database= user= rhost= tid=1/14
sessionid=59fa765f.31bc STATEMENT: truncate table pg_statistic;

On Tue, Oct 31, 2017 at 10:37 PM, Arjun Ranade <ranade(at)nodalexchange(dot)com>
wrote:

> You might be onto something. I force quit the pglogical supervisor and
> other session that had a backend_start of a few months ago. That didn't
> fix the issue. Additionally, those processes didn't start back up which is
> making me worry about the status of my logical replication. The
> subscribers are reporting that they are still replicating but I won't
> know.
>
> However... when I was checking on the subscribers, I dropped their
> replication links and recreated them (to see if that would get the
> pglogical processes to start back up). I checked the pg_log on the master
> (same server having the vacuum problem) to see if there was anything about
> the subscriptions resuming and I found these messages in the log:
>
> 2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517
> sessionid=59f92ded.6f1d WARNING: oldest xmin is far in the past
> 2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517
> sessionid=59f92ded.6f1d HINT: Close open transactions soon to avoid
> wraparound problems.
> 2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521
> sessionid=59f92e8d.711b WARNING: oldest xmin is far in the past
> 2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521
> sessionid=59f92e8d.711b HINT: Close open transactions soon to avoid
> wraparound problems
>
> However, I am now in a situation where I cannot find any old transactions
> from pg_stat_activity or pg_prepared_xacts.
>
> It may also be worth noting that the last time our db was restarted (~6
> months ago, same date as the old processes that I killed) was due to the
> datafile partition filling up.
>
> FWIW, here are my vacuum related settings on this db:
>
> name setting
> autovacuum on
> autovacuum_analyze_scale_factor 0.1
> autovacuum_analyze_threshold 50
> autovacuum_freeze_max_age 200000000
> autovacuum_max_workers 3
> autovacuum_multixact_freeze_max_age 400000000
> autovacuum_naptime 60
> autovacuum_vacuum_cost_delay 20
> autovacuum_vacuum_cost_limit -1
> autovacuum_vacuum_scale_factor 0.2
> autovacuum_vacuum_threshold 50
> autovacuum_work_mem -1
> log_autovacuum_min_duration -1
> vacuum_cost_delay 0
> vacuum_cost_limit 200
> vacuum_cost_page_dirty 20
> vacuum_cost_page_hit 1
> vacuum_cost_page_miss 10
> vacuum_defer_cleanup_age 0
> vacuum_freeze_min_age 50000000
> vacuum_freeze_table_age 150000000
> vacuum_multixact_freeze_min_age 5000000
> vacuum_multixact_freeze_table_age 150000000
>
>
> On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Arjun Ranade <ranade(at)nodalexchange(dot)com> writes:
>> > There is nothing in pg_prepared_xacts, however, in pg_stat_activity
>> there
>> > are two pglogical processes that have a "backend_start" of "2017-06-17"
>> > when the last time we restarted this server. Both are not waiting and
>> have
>> > null for "state." This might be expected behavior for pglogical.
>>
>> I don't know much about pglogical, but if it's holding back the xmin
>> horizon (as it appears to be doing) that is a really bad pglogical bug.
>>
>> If you try vacuum verbose (doesn't have to be FULL) on some table that
>> gets lots of update traffic, does it report a large number of dead-but-
>> not-removable rows?
>>
>> > I did try TRUNCATE before but even as the postgres user, but for some
>> > reason it didn't allow me to truncate because it was a system table.
>>
>> Mmm. You could get around that, but possibly not without restarting
>> in single-user mode, which is probably more interruption in service
>> than you want. It might be easier to force-quit the pglogical sessions.
>>
>> regards, tom lane
>>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Arjun Ranade 2017-11-02 01:56:48 Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Previous Message Stephen Frost 2017-11-01 22:18:07 Re: Bad recovery: no pg_xlog/RECOVERYXLOG