Re: PQexec() hangs on OOM

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Oleksandr Shulgin <oleksandr(dot)shulgin(at)zalando(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: PQexec() hangs on OOM
Date: 2015-09-04 07:25:35
Message-ID: CAB7nPqRWbL15EFVQdEzcgR4aUrTCtkCJKvUoRkQysgbCjEezKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Sep 4, 2015 at 2:07 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:

>
> On Mon, Aug 31, 2015 at 12:55 PM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>> On Sat, Aug 29, 2015 at 10:35 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>>>
>>>
>>> One thing that looks slightly non-elegant about this approach is that
>>> new async status (PGASYNC_FATAL) is used to deal with errors only
>>> in few paths in function pqParseInput3() and not-other paths which should
>>> be okay if there is no other better way.
>>>
>>
>> I considered using this logic in more code paths, but I kept focused on
>> having a not-too-invasive back-patchable patch as the refactoring that
>> would occur may be too heavy for what is usually pushed to the stable
>> branches. Do you think it would be better to get a cleaner refactoring
>> patch first and globalize the approach with PGASYNC_FATAL?
>>
>
> No wait, lets first try to see if this the best fix for Copy path.
>

Sure. Thanks. (We could remove some dead code of libpq though).

> The problem with the current approach is that even if we are able to
> receive error, it doesn't completely clear the previous command
> execution. As an example, if using debugger, I make getCopyStart()
> return OOM error, then the next command execution fails.
>
> postgres=# copy t1 from stdin;
> out of memory
> postgres=# copy t1 from stdin;
> another command is already in progress
>

Oops. Indeed. Using this new flag is obviously not a good idea because the
connection remains in a state that it considers as PGASYNC_FATAL and
actually it may not even finish to consume the remaining messages. Perhaps
I was too sleepy last time I tested that, well...

> It was actually a far cleaner approach to have a failure flag to decide if
>> based on the async state process should move on to a failure code path or
>> not.
>>
>
> I think you have already tried, but it seems to me that if we can handle
> it based on result status, that would be better, rather than introducing
> new state, but anyway lets first try to sort out above reported problem.
>

So, looking at that again with a fresh eye, I noticed that
getCopyDataMessage has a similar error handling when it fails on
pqCheckInBufferSpace. So looking at what I added, it seems that I am trying
to duplicate the protocol error handling even if everything is already
present, so that's really overdoing it. Hence, let's simply drop this idea
of new flag PGASYNC_FATAL and instead treat the OOM as a sync handling
error with the server, like in the patch attached.

When emulating an OOM with this patch, I am getting this error instead of
the infinite loop, which looks acceptable to me:
=# copy aa to stdout;
out of memory
lost synchronization with server: got message type "H", length 5
The connection to the server was lost. Attempting reset: Succeeded.

The extra message handling I have added at the end of getCopyStart and
getParamDescriptions still looks more adapted to me when a failure happens,
so I kept it.
Regards,
--
Michael

Attachment Content-Type Size
20150904_libpq_oom_v2.patch application/x-patch 7.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Teodor Sigaev 2015-09-04 08:13:47 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message Amit Kapila 2015-09-04 05:07:07 Re: PQexec() hangs on OOM