Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Date: 2021-11-12 23:12:41
Message-ID: CAH2-WzmLkkc8W09SGxnv9pD_v=2waa1e3TmYQHBDZaVOdXEyfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Nov 12, 2021 at 2:57 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > What would it actually mean to rely on it, or to not rely on it?
>
> That we shouldn't throw an error / assert out if we find such a tuple.

I agree with you about making this an error -- let's not do that. But
I disagree about making this an assertion. I *strongly* disagree, in
fact.

I am quite willing to accept general uncertainty about what's really
possible when it comes to what HTSV might say in edge cases. But I
think we need to take a firm position on what we believe is possible,
by having an assertion that formally represents what we believe to be
true, including our reasoning. If that turns out to be wrong, great --
we then have the opportunity to be less wrong in the future. What has
it cost us? The minor annoyance of turning a buildfarm animal red for
a while?

> > When you talk about what HTSV thinks of the tuple, you're merely talking
> > about how to behave in the event of a specific form of HOT chain corruption
> > (a theoretical background risk for HOT chains that's nothing new).
>
> My point is that I don't think it necessarily signals corruption. But a very
> short term transient state under heavy concurrency.

I think that that's reasonable as a working assumption -- I really do.
I also think that you need pretty thorough assertions for this.

> > We need to be pragmatic here. There is some uncertainty about what
> > HTSV might say about a disconnected tuple in the absence of
> > corruption, or there is a risk of a new problem like that coming up in
> > the future -- let's work within those confines, then. What do you want
> > to do about that? There aren't that many choices, since, to repeat,
> > the tuple is "morally" DEAD no matter what. Even with corruption, even
> > without corruption in the presence of some unanticipated corner case
> > with HTSV -- this is fundamental.
>
> I think we can assert/error out if it's visible, that's clearly
> corruption. I'd personally not add assert/error checks for other states, given
> that it could plausible happen without indicating a problem. Debugging
> transient errors that happen rarely, under high load, with nontrivial
> workloads isn't fun.

What if it's just an assertion failure (just for non-LIVE HTSV return
codes)? An assertion is a very different thing to an defensive "can't
happen" ERROR.

There is a decent chance that there is a bug if the return code is,
say, HEAPTUPLE_INSERT_IN_PROGRESS. I just cannot think of a scenario
where this specific code path sees a tuple as INSERT_IN_PROGRESS that
isn't also very broken.

--
Peter Geoghegan

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-11-12 23:31:46 Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Previous Message Andres Freund 2021-11-12 22:57:25 Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum