Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Brian Ghidinelli <brian(at)vfive(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`
Date: 2016-02-19 16:15:48
Message-ID: 20160219161548.GA90757@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Brian Ghidinelli wrote:
>
> > On Feb 18, 2016, at 12:59, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:
> >
> > Okay, great. What would be most helpful is figuring out the pg_upgrade
> > history of this server; if you have copies of the cluster just before
> > the upgrade, to extract the "nextMultiXactId" value, that would be
> > useful.
>
> Unfortunately we removed the 9.3 data dir for space reasons… I may
> have backups from then so maybe I could spin up a docker container and
> restore it but would that tell us the same thing?

What kind of backups? If you mean pg_dump, then it's of no use. If you
have physical backups, then yes, it could be useful, assuming they are
not much older than the upgrade (one from immediately before the upgrade
would be best.)

> > How large is this table? We could try to scan it to look for the values
> > that are causing the problem, and set oldestMxact to one that would not
> > cause a problem.
>
> Database size is 34gb. This particular table is only 105MB. If you
> account for all of the relations and indices it’s 239MB. There is one
> table in the system which is 17GB that stores email campaigns and
> deliveries.

Email campaigns! I'm always happy to receive those.

> Everything else is all pretty small-ish at 2.5gb or under.
>
> How do you query for oldestMxact?

What I was thinking is scanning the table using pageinspect
(get_raw_page + heap_page_items), grab the xmax value of the tuples that
have the HEAP_XMAX_IS_MULTI bit set, looking for the informask
combination that indicate's prior-to-pg_upgrade.

The combination to find is that HEAP_XMAX_LOCK_ONLY (0x0080) must be
set, and both HEAP_XMAX_EXCL_LOCK (0x0040) and HEAP_XMAX_KEYSHR_LOCK
(0x0010) must be cleared. (See for instance the comments inside
MultiXactIdGetUpdateXid in src/backend/access/heap/heapam.c, where this
particular bit pattern is tested.)

This might help:
https://www.commandprompt.com/blogs/alvaro_herrera/2011/11/decoding_infomasks/

Add one to the newest multixact value you find with that combination of
bits, then use that as the new cluster's "oldestMultiXact" to
pg_resetxlog (make sure to shut down cleanly!) as in the previous post
you found.

In retrospect, it sounds like we should have made GetMultiXactIdMembers
receive the infomask too and check for the pre-upgrade combination of
bits. It would have saved some headaches. But I'm not sure it's worth
doing now.

> > How large is the cluster? For experimentation, it would be very useful
> > if you could take a copy of it, on a server where you could recompile
> > with debugging symbols.
>
> Is there a Docker container by chance that has symbols enabled? That
> would make standing up a test environment a lot easier. Our production
> infrastructure is not yet inside Docker but we run it in dev and it’s
> easy to spin up and throw away.

There may be one, but then I wouldn't know where to find it. I bet you
can modify one --- it's just a "--enable-debug" option to Postgres'
configure.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2016-02-19 16:26:10 Re: BUG #13974: temp_file_limit effects vacuum
Previous Message Amit Kapila 2016-02-19 13:33:34 Re: Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby