Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619
Date: 2021-05-15 22:21:13
Message-ID: 2472976.1621117273@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> After my commit c532d15ddd to split up copy.c, buildfarm animal "prion"
> failed in pg_upgrade
> (https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-11-23%2009%3A23%3A16)

prion's continued to fail, rarely, in this same way:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-15%2004%3A08%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-05-12%2018%3A43%3A14

The failures are remarkably identical, and they also look a lot like
field reports we've been seeing off and on for years. I do not know
why it always seems to be pg_toast_2619 (i.e. pg_statistic) that's
affected, but the pattern is pretty undeniable by now.

What I do have that's new is that *I can reproduce it*, at long last.
For me, installing the attached patch and running pg_upgrade's
"make check" fails, pretty much every time, with symptoms identical
to prion's.

The patch consists of

(1) 100ms delay just before detoasting, when loading a pg_statistic
catcache entry that has toasted datums

(2) provisions to mark such catcache entries dead immediately
(effectively, CATCACHE_FORCE_RELEASE for only these tuples);
this is to force us through (1) as often as possible

(3) some quick-hack debugging aids added to HeapTupleSatisfiesToast,
plus convert the missing-chunk error to PANIC to get a stack
trace.

If it doesn't reproduce for you, try adjusting the delay. 100ms
was the first value I tried, though, so I think it's probably
not too sensitive.

The trace I'm getting shows pretty positively that autovacuum
has fired on pg_statistic, and removed the needed toast entries,
just before the failure. So something is wrong with our logic
about when toast entries can be removed.

I do not have a lot of idea why, but I see something that is
probably a honkin' big clue:

2021-05-15 17:28:05.965 EDT [2469444] LOG: HeapTupleSatisfiesToast: xmin 2 t_infomask 0x0b02

That is, the toast tuples in question are not just frozen, but
actually have xmin = FrozenTransactionId.

I do not think that is normal --- at least, it's not the state
immediately after initdb, and I can't make it happen with
"vacuum freeze pg_statistic". A plausible theory is that pg_upgrade
caused this to happen (but how?) and then vacuuming of toast rows
goes off the rails because of it.

Anyway, I have no more time to poke at this right now, so I'm
posting the reproducer in case someone else wants to look at it.

regards, tom lane

Attachment Content-Type Size
trigger-missing-chunk-number.patch text/x-diff 3.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-05-15 23:01:25 Re: compute_query_id and pg_stat_statements
Previous Message Bruce Momjian 2021-05-15 21:32:58 Re: compute_query_id and pg_stat_statements