Re: BUG #17791: Assert on procarray.c

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robins Tharakan <tharakan(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17791: Assert on procarray.c
Date: 2023-02-15 05:06:12
Message-ID: 20230215050612.po5rjq6zd7oq7cu6@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2023-02-15 14:46:13 +1030, Robins Tharakan wrote:
> Thanks for taking a look and possibly you're correct with your
> assumption. I mean I see a ton of FATALs but let me know if I am
> mistaken in assuming them to be harmless (since they just convey that
> the client's gone away)?

Those are indeed not very interesting - although it'd be interesting to know
what caused the clients to go away.

> Nonetheless, I have provided error logs going back till Oct 22 just in
> case the engine can recover from some of those scenarios. Two things
> about the test scenario that may be relevant:
>
> 1. Since performance was the least of my worries, the postgres server
> and the client workload are on the same box. Add dblink / FDW to this
> mix, and it is easy to end up with a ton of loopback connections
> (think SELECT dblink_conect() FROM pg_catalog.pg_class) - IMO
> noteworthy, since there are a ton of "Broken pipe"s and one instance
> of 'too many file descriptors'.

I think the "too many file descriptors" bit might be the interesting part.

I suspect the reason you're not seeing this on newer versions is that 13+ has

commit 3d475515a15f70a4a3f36fbbba93db6877ff8346
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2020-02-24 17:28:33 -0500

Account explicitly for long-lived FDs that are allocated outside fd.c.

But I can't yet explain precisely why that causes the assertion failures. A
vague guess is that we fail to write 2PC state files due to the lack of FD
accounting, throw an error due to that, and then fail with that assert during
handling the error.

It might be worth trying to reproduce the issue with a much lower ulimit -S
-n, to reach the problematic state more quickly. A reproducer would be very
useufl.

> 2. All versions are subjected to similar workload and it is possible
> that v13+ has generally improved in this area, and thus this possibly
> crashes less? Unsure.

What range of versions / commits are you testing this workload on?

Are you testing 11 as well? Because I don't see why we'd have the issue on 12,
but not 11.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alexander Bluce 2023-02-15 05:41:46 Re: BUG #17782: ERROR: variable not found in subplan target lists
Previous Message PG Bug reporting form 2023-02-15 04:21:54 BUG #17794: dates with zero or negative years are not accepted