Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?

From: Arjun Ranade <ranade(at)nodalexchange(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?
Date: 2017-11-01 02:37:28
Message-ID: CANrrCRxcFdXQvae7vrtWttHtXbVmB8yoYEt+2DhfADqm1fvb-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

You might be onto something. I force quit the pglogical supervisor and
other session that had a backend_start of a few months ago. That didn't
fix the issue. Additionally, those processes didn't start back up which is
making me worry about the status of my logical replication. The
subscribers are reporting that they are still replicating but I won't
know.

However... when I was checking on the subscribers, I dropped their
replication links and recreated them (to see if that would get the
pglogical processes to start back up). I checked the pg_log on the master
(same server having the vacuum problem) to see if there was anything about
the subscriptions resuming and I found these messages in the log:

2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517
sessionid=59f92ded.6f1d WARNING: oldest xmin is far in the past
2017-10-31 22:16:42.658 EDT pid=28445 database= user= rhost= tid=3/2970517
sessionid=59f92ded.6f1d HINT: Close open transactions soon to avoid
wraparound problems.
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521
sessionid=59f92e8d.711b WARNING: oldest xmin is far in the past
2017-10-31 22:16:45.815 EDT pid=28955 database= user= rhost= tid=3/2970521
sessionid=59f92e8d.711b HINT: Close open transactions soon to avoid
wraparound problems

However, I am now in a situation where I cannot find any old transactions
from pg_stat_activity or pg_prepared_xacts.

It may also be worth noting that the last time our db was restarted (~6
months ago, same date as the old processes that I killed) was due to the
datafile partition filling up.

FWIW, here are my vacuum related settings on this db:

name setting
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 50
autovacuum_freeze_max_age 200000000
autovacuum_max_workers 3
autovacuum_multixact_freeze_max_age 400000000
autovacuum_naptime 60
autovacuum_vacuum_cost_delay 20
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 50
autovacuum_work_mem -1
log_autovacuum_min_duration -1
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_defer_cleanup_age 0
vacuum_freeze_min_age 50000000
vacuum_freeze_table_age 150000000
vacuum_multixact_freeze_min_age 5000000
vacuum_multixact_freeze_table_age 150000000

On Tue, Oct 31, 2017 at 6:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Arjun Ranade <ranade(at)nodalexchange(dot)com> writes:
> > There is nothing in pg_prepared_xacts, however, in pg_stat_activity there
> > are two pglogical processes that have a "backend_start" of "2017-06-17"
> > when the last time we restarted this server. Both are not waiting and
> have
> > null for "state." This might be expected behavior for pglogical.
>
> I don't know much about pglogical, but if it's holding back the xmin
> horizon (as it appears to be doing) that is a really bad pglogical bug.
>
> If you try vacuum verbose (doesn't have to be FULL) on some table that
> gets lots of update traffic, does it report a large number of dead-but-
> not-removable rows?
>
> > I did try TRUNCATE before but even as the postgres user, but for some
> > reason it didn't allow me to truncate because it was a system table.
>
> Mmm. You could get around that, but possibly not without restarting
> in single-user mode, which is probably more interruption in service
> than you want. It might be easier to force-quit the pglogical sessions.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Ashok Kumar Tiwari 2017-11-01 05:29:20 ream and processor required
Previous Message Tom Lane 2017-10-31 22:12:39 Re: Missing Chunk Error when doing a VACUUM FULL operation - DB Corruption?