Quick Links

Re: OOM in libpq and infinite loop with getCopyStart()

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: OOM in libpq and infinite loop with getCopyStart()
Date:	2016-03-22 07:37:07
Message-ID:	CAB7nPqQ6wHEE8U4Sw3-3WVj8mXbZN7VXYVdor7iXyQHGhe+tmw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 22, 2016 at 2:19 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Mar 22, 2016 at 9:46 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>
>> Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> writes:
>> > On Mon, Mar 21, 2016 at 10:13 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
>> > wrote:
>> >> It is very difficult to believe that this is a good idea:
>> >>
>> >> --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> >> +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c
>> >> @@ -445,6 +445,7 @@ libpqrcv_PQexec(const char *query)
>> >> if (PQresultStatus(lastResult) == PGRES_COPY_IN ||
>> >> PQresultStatus(lastResult) == PGRES_COPY_OUT ||
>> >> PQresultStatus(lastResult) == PGRES_COPY_BOTH ||
>> >> + PQresultStatus(lastResult) == PGRES_FATAL_ERROR ||
>> >> PQstatus(streamConn) == CONNECTION_BAD)
>> >> break;
>> >>
>> >> I mean, why would it be a good idea to blindly skip over fatal errors?
>>
>> > I think it is not about skipping the FATAL error, rather to stop trying
>> > to
>> > get further results on FATAL error.
>>
>> If the code already includes "lost the connection" as a case to break on,
>> I'm not quite sure why "got a query error" is not.
>>
>
> This error check is exactly same as PQexecFinish() and there some
> explanation is given in comments which hints towards the reason for
> continuing on error, basically both the functions PQexecFinish() and
> libpqrcv_PQexec() returns the last result if there are many and
> PQexecFinish() concatenates the errors as well in some cases. Do we see any
> use in continuing to get result after getting PGRES_FATAL_ERROR error?

I spent quite a couple of hours looking at that, and I still fail to
see how it would be an advantage to stack errors. We reduce the error
message visibility for frontend clients, and this does not prevent
clients to actually see error messages generated by a backend, take a
WAL sender. And actually depending on the message type received we may
end up by simply ignoring those error messages and have libpq discard
them. So it seems like an oversight of 03a571a4.

One could always say that this change breaks the case where multiple
error messages are sent in a row from a client with COPY_* (IN, OUT
and BOTH), because we'd get back the last error message, while with
this change we let client know the first one, though that would mean
that client is missing something when using COPY_BOTH.

Also...
@@ -2023,6 +2030,11 @@ PQexecFinish(PGconn *conn)
result->resultStatus == PGRES_COPY_BOTH ||
conn->status == CONNECTION_BAD)
break;
+ else if ((conn->asyncStatus == PGASYNC_COPY_IN ||
+ conn->asyncStatus == PGASYNC_COPY_OUT ||
+ conn->asyncStatus == PGASYNC_COPY_BOTH) &&
+ result->resultStatus == PGRES_FATAL_ERROR)
+ break;
The reason behind this check in PQexecFinish is that we need to make
the difference between an error where there is not enough data, which
means that we want to retry again the message fetching through
parseInput3(), and the case where an actual OOM happened, where we
want to exit immediately and let the caller know that there has been a
frontend error. Now the other reason why we went on with
PGRES_FATAL_ERROR here is that it was discussed that it was an
overkill to assign a special status to PGASYNC to detect a
frontend-side failure, because we already treat other frontend-side
errors with PGRES_FATAL_ERROR and assign an error message to them (see
for example when an allocation fails for 'C', introduced in another
patch of the OOM failure series), and we still need to know that we
are in PGASYNC_COPY_* mode in this code path as well because it means
that the start message has been processed, but we had a failure in
deparsing it so we bypassed it, and need to fail immediately. So using
PGREC_FATAL_ERROR simplifies the code paths creating an error stack.

Hope this brings some light in.
--
Michael

In response to

Re: OOM in libpq and infinite loop with getCopyStart() at 2016-03-22 05:19:29 from Amit Kapila

Responses

Re: OOM in libpq and infinite loop with getCopyStart() at 2016-03-29 15:01:45 from David Steele

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2016-03-22 07:53:31	Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Previous Message	Craig Ringer	2016-03-22 07:10:21	Re: Applying logical replication changes by more than one process