Re: relation OID in ReorderBufferToastReplace error message

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Drouvot, Bertrand" <bdrouvot(at)amazon(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: relation OID in ReorderBufferToastReplace error message
Date: 2021-07-15 00:44:34
Message-ID: ace5f8ea-4777-b045-ba02-6a042a071919@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/2/21 18:57, Jeremy Schneider wrote:
> On 7/1/21 20:44, Tom Lane wrote:
>> So I think the relation name is what to print here. That's generally
>> what we do, and there's not much argument for this case to be different.
>
> Works for me.  Anything in the error message is quickly and easily
> visible to users, without attaching debuggers or decoding WAL. A lot
> of PostgreSQL users won't know how to do the advanced troubleshooting,
> but knowing the table name might give a clue for better identifying
> SQL that might have been related, and could help produce better bug
> reports to the mailing lists in the future.
>
> Related to that, I do think it could be useful to backpatch this - we
> know that users are hitting this error on older versions.  Even though
> it's most likely that the speculative insert decoding memory leak bug
> will address the problem, this update still seems low-risk and useful
> to me just in the off-chance that someone hits the error again.

I have a few new thoughts, related to some new recent developments here.

Back in the first thread, Amit had asked if this was reproducible.  The
user we know of who recently ran into this error hasn't been able to
come up with a series of steps that reliably reproduces it, but they did
encounter the error again last week on a test system after they left a
test workload running for some time. (This apparently took a bit of
effort to do and isn't easily repeatable.)

We investigated that system and we verified specifically for the
incident last week that it definitely did *not* hit the speculative
insert abort bug. We attached a debugger and checked the
XLH_DELETE_IS_SUPER flag just before the decoding process error-ed out,
and the flag was not set.

We looked a bit further at that system and while I'm not sure everything
is solved, it did help Bertrand find one clear bug with an obvious
reproduction. We have proposed that fix for the next commitfest:

https://commitfest.postgresql.org/34/3241/

The process of trying to understand this recent incident has given me
some new insight about what information would be helpful up front in
this error message for faster resolution.

First off, and most importantly, the current WAL record we're processing
when the error is encountered. I wonder if it could easily print the LSN?

Secondly, the transaction ID. In the specific bug Bertrand found, the
problem is actually not with the actual WAL record that's being
processed - but rather because previous WAL records in the same
transaction left the decoder process in a state where the current WAL
record [a commit] generated an error.  So it's the entire transaction
that needs to be examined to reproduce the error.  (Andres actually
pointed this out on the original thread back in December 2019.)  I
realize that once you know the LSN you can easily get the XID with
pg_waldump, but personally I'd just as soon include the XID in the error
message since I think it will usually be a first step for debugging any
problems with WAL decoding. The I can go straight to filtering that XID
on my first pg_waldump run.

Thoughts?

-Jeremy

--
Jeremy Schneider
Database Engineer
Amazon Web Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2021-07-15 01:21:51 Re: [PATCH] Use optimized single-datum tuplesort in ExecSort
Previous Message Ranier Vilela 2021-07-15 00:30:38 Re: [PATCH] Use optimized single-datum tuplesort in ExecSort