From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com> |
Subject: | Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619 |
Date: | 2021-05-15 22:21:13 |
Message-ID: | 2472976.1621117273@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Heikki Linnakangas <hlinnaka(at)iki(dot)fi> writes:
> After my commit c532d15ddd to split up copy.c, buildfarm animal "prion"
> failed in pg_upgrade
> (https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-11-23%2009%3A23%3A16)
prion's continued to fail, rarely, in this same way:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-15%2004%3A08%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-05-12%2018%3A43%3A14
The failures are remarkably identical, and they also look a lot like
field reports we've been seeing off and on for years. I do not know
why it always seems to be pg_toast_2619 (i.e. pg_statistic) that's
affected, but the pattern is pretty undeniable by now.
What I do have that's new is that *I can reproduce it*, at long last.
For me, installing the attached patch and running pg_upgrade's
"make check" fails, pretty much every time, with symptoms identical
to prion's.
The patch consists of
(1) 100ms delay just before detoasting, when loading a pg_statistic
catcache entry that has toasted datums
(2) provisions to mark such catcache entries dead immediately
(effectively, CATCACHE_FORCE_RELEASE for only these tuples);
this is to force us through (1) as often as possible
(3) some quick-hack debugging aids added to HeapTupleSatisfiesToast,
plus convert the missing-chunk error to PANIC to get a stack
trace.
If it doesn't reproduce for you, try adjusting the delay. 100ms
was the first value I tried, though, so I think it's probably
not too sensitive.
The trace I'm getting shows pretty positively that autovacuum
has fired on pg_statistic, and removed the needed toast entries,
just before the failure. So something is wrong with our logic
about when toast entries can be removed.
I do not have a lot of idea why, but I see something that is
probably a honkin' big clue:
2021-05-15 17:28:05.965 EDT [2469444] LOG: HeapTupleSatisfiesToast: xmin 2 t_infomask 0x0b02
That is, the toast tuples in question are not just frozen, but
actually have xmin = FrozenTransactionId.
I do not think that is normal --- at least, it's not the state
immediately after initdb, and I can't make it happen with
"vacuum freeze pg_statistic". A plausible theory is that pg_upgrade
caused this to happen (but how?) and then vacuuming of toast rows
goes off the rails because of it.
Anyway, I have no more time to poke at this right now, so I'm
posting the reproducer in case someone else wants to look at it.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
trigger-missing-chunk-number.patch | text/x-diff | 3.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2021-05-15 23:01:25 | Re: compute_query_id and pg_stat_statements |
Previous Message | Bruce Momjian | 2021-05-15 21:32:58 | Re: compute_query_id and pg_stat_statements |