From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | daniel(at)citusdata(dot)com, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #15745: WAL References Invalid Pages...that eventually resolves |
Date: | 2019-04-28 03:28:04 |
Message-ID: | CAH2-Wzmrx1Je1=hfqpvz22s+nP2uvR9mqQKTQP5hPSxbok=B7w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi Daniel,
On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
> But, for serendipitous reasons, I let this one run for a while. As it turns
> out, with each crash, it would make *slightly* more progress than the time
> before....and then eventually, it suffered no more faults and caught up
> normally. Included is a log that shows how sparse these faults were,
> relative to all the traffic going on....: roughly two per segment on this
> workload, with large gaps between problematic segments, and not necessarily
> repetition in a problematic relation or filenode.
That sounds weird.
> The fact the standby eventually came up made me suspicious, so I ran amcheck
> with a heap re-check, and, no tuples were in violation.
>
> Included is a log, which shows how the system recovered over and over,
> making slight progress each time. This is the entire inventory after such
> crashes: after these, the system passed amcheck and appears to work
> normally.
Did you try bt_index_parent_check('rel', true)? You might want to make
sure that work_mem is set sufficiently high so that the
downlink-block-is-present check is definitely effective; work_mem
bounds the size of a Bloom filter used by the implementation (the heap
verification option has its own Bloom filter, bound by
maintenance_work_mem). Suggest that you "set
client_min_messages=debug1" before running amcheck this way, just in
case that shows something interesting.
> postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
> WARNING: page 162136064 of relation base/16385/21372 does not exist
These WARNING messages all reference block numbers that look like
32-bits of random garbage, but could be from a very large relation.
The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
collection (not VACUUM). Note that Andres changed this mechanism for
v12, so that latestRemovedXid was calculated on the primary, rather
than on the standby. I think that this error comes from
btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
master/12.
I wonder, is "base/16385/21351" the index or the table? Is it possible
to run pg_waldump? I think it's the table.
If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
is perhaps unsurprising that there isn't evidence of any lasting
corruption.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | PG Bug reporting form | 2019-04-28 04:09:18 | BUG #15783: Fail to select with a function in FROM clause plus another table |
Previous Message | Jonathan S. Katz | 2019-04-28 01:37:53 | Re: BUG #15706: Support Services page out of date |