Re: VACUUM unable to accomplish because of a non-existent MultiXactId

From: Kouber Saparev <kouber(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: VACUUM unable to accomplish because of a non-existent MultiXactId
Date: 2015-11-28 10:52:58
Message-ID: 99572764-62FE-4C23-8667-53B648C8DFD4@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


> The state of your data is probably caused by some weird corner case of
> the upgrade. Can you see in the log files that the toast table has been
> failing vacuum since the upgrade, or is it more recent than that? (In
> other words, is there any working vacuum after the upgrade?)

We upgraded to 9.4.5 on 19 October, and there was a successful automatic vacuum over pg_toast_376621 just 3 days later - on 22 October:

Oct 22 08:16:49 db-master postgres[10589]: [3-1] []: LOG: automatic vacuum of table “db.pg_toast.pg_toast_376621": index scans: 1
Oct 22 08:16:49 db-master postgres[10589]: [3-2] pages: 0 removed, 784361 remain
Oct 22 08:16:49 db-master postgres[10589]: [3-3] tuples: 110 removed, 3768496 remain, 0 are dead but not yet removable
Oct 22 08:16:49 db-master postgres[10589]: [3-4] buffer usage: 37193 hits, 44891 misses, 32311 dirtied
Oct 22 08:16:49 db-master postgres[10589]: [3-5] avg read rate: 0.954 MB/s, avg write rate: 0.686 MB/s
Oct 22 08:16:49 db-master postgres[10589]: [3-6] system usage: CPU 1.10s/1.67u sec elapsed 367.73 sec

The next automatic vacuum came 8 days later - on 30 October and failed and it is failing ever since:

Oct 30 14:22:01 db-master postgres[16160]: [3-1] []: ERROR: MultiXactId 2915905228 does no longer exist -- apparent wraparound
Oct 30 14:22:01 db-master postgres[16160]: [3-2] []: CONTEXT: automatic vacuum of table “db.pg_toast.pg_toast_376621”

So I guess something happened between 22 and 30 October and there is no relation to the pg_upgrade we did on 19 October.

> It would be useful to debug this that you attached gdb to a backend, set
> breakpoint on errfinish, then run vacuum on that table. Then you can
> extract the page number from the backtrace. With the page number we can
> try pageinspect and heap_page_items until we find the culprit and
> perhaps identify how it got in that state.

I will try to obtain the page number, and will then send you the results, thank you.

Can we somehow do it on one of our replicas (after detaching it), i.e. is the corrupted record propagated through the replication channel, and in the meantime fix the table on the master?

Thanks!


Kouber Saparev

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Alvaro Herrera 2015-11-28 15:20:44 Re: VACUUM unable to accomplish because of a non-existent MultiXactId
Previous Message Alvaro Herrera 2015-11-27 20:47:57 Re: VACUUM unable to accomplish because of a non-existent MultiXactId