Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Etsuro Fujita <etsuro(dot)fujita(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Date: 2024-02-08 23:50:32
Message-ID: 1273950.1707436232@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Andres Freund <andres(at)anarazel(dot)de> writes:
> I think we ought to understand *why* we are getting the "Too many open
> files". The AcquireExternalFD() in CreateWaitEventSet() should prevent
> that.

Actually, I think the AcquireExternalFD() in CreateWaitEventSet() is
*causing* that and needs to be removed.

What is happening in Alexander's new example is that we are doing
AcquireExternalFD() for each postgres_fdw connection
(cf. libpqsrv_connect in libpq-be-fe-helpers.h), and the example
is tuned to bring that exactly up to the limit of what
AcquireExternalFD() allows. Then the next WaitLatchOrSocket call
fails, because it does

WaitEventSet *set = CreateWaitEventSet(CurrentResourceOwner, 3);

Then when pgfdw_abort_cleanup tries to clean up the connections'
state, it needs to do WaitLatchOrSocket again, and that fails again,
and we PANIC because we're already in abort state.

Since WaitLatchOrSocket is going to free this WaitEventSet before it
returns, it's not apparent to me why we need to count it as a
long-lived FD: we could just as well assume that it can slide in under
the NUM_RESERVED_FDS limit. Or perhaps use ReserveExternalFD instead
of AcquireExternalFD. We'd need some API extension to tell latch.c to
do that, but that doesn't seem hard. (Unless we could consider that
all WaitEventSets should use ReserveExternalFD? Not sure I want to
argue for that though.)

I guess a third possibility is that WaitLatchOrSocket could just
permanently hang onto the WaitEventSet once it's got one.

> One annoying bit is that AcquireExternalFD() failing emits the same error as
> if epoll_create1() itself failing, including the same errno.

It's the former. I tend to agree now that maybe using the same error
text wasn't too smart.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2024-02-09 00:07:34 Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds
Previous Message Tom Lane 2024-02-08 22:06:46 Re: BUG #17828: postgres_fdw leaks file descriptors on error and aborts aborted transaction in lack of fds