Re: walsender termination error messages worse in v10

From: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Re: walsender termination error messages worse in v10
Date: 2017-06-09 01:16:59
Message-ID: 68260deb-c1bf-a450-8423-908a024a2a95@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/06/17 23:57, Andres Freund wrote:
> On 2017-06-03 00:55:22 +0200, Petr Jelinek wrote:
>> On 02/06/17 23:45, Andres Freund wrote:
>>> Hi Petr,
>>>
>>> On 2017-06-02 22:57:37 +0200, Petr Jelinek wrote:
>>>> On 02/06/17 20:51, Andres Freund wrote:
>>>>> I don't understand why the new block is there, nor does the commit
>>>>> message explain it.
>>>>>
>>>>
>>>> Hmm, that particular change can actually be reverted. It was needed for
>>>> one those custom replication commands which were replaced by normal
>>>> query support. I have missed it during the rewrite.
>>>
>>> Doesn't appear to be quite that simple, I get regression test failures
>>> in that case.
>>>
>>
>> Hmm, looks like we still use it for normal COPY handling. So basically
>> the problem is that if we run COPY TO STDOUT and then consume it using
>> the libpqrcv_receive it will end with normal PGRES_COMMAND_OK but we
>> need to call PQgetResult() in that case otherwise libpq thinks the
>> command is still active and any following command will fail, but if we
>> call PQgetResult on dead connection we get that error you complained about.
>
> Should this possibly handled at the caller level? This is a bit too
> much magic for my taste.

It is, I agree with that. I guess the main issue with handling it in
callers is that callers can't access libpq api so there would need to be
some additional abstraction (we did some of the necessary parts of that
for libpqrcv_exec).

>
> Looking at the callers, the new code isn't super-obvious either:
>
> len = walrcv_receive(wrconn, &buf, &fd);
>
> if (len != 0)
> {
> /* Process the data */
> for (;;)
> {
> CHECK_FOR_INTERRUPTS();
>
> if (len == 0)
> {
> break;
> }
> else if (len < 0)
> {
> ereport(LOG,
> (errmsg("data stream from publisher has ended")));
> endofstream = true;
> break;
> }
>
> The len < 0, hidden inside a len != 0, which in the loop again chcks if
> len == 0 (because it's decremented in loop)? And there's two different[5~
> len = walrcv_receive(wrconn, &buf, &fd);
> statements?
>

The logic there is that you have one main loop and one busy loop. When
there is flow of data, things happen as the data are processed and we
don't need to do the bookkeeping of timeouts, latching, other
maintenance tasks that need to be done (like checking table
synchronization in logical replication) because it all happens naturally
and that's what the inner loop is for. But when the upstream gets idle
and does not send anything (or there is network issue, etc) we need to
do these things (otherwise we'd miss timeouts, table synchronization
worker would wait forever on apply, etc) and that's what the outer loop
is for.

We only enter the inner loop if there were some data that's why there is
the firs len != 0, but then the inner loop bases decisions on the return
code of the walrcv_receive() so the if is repeated again (it also will
call walrcv_receive() as long as there are data so it needs to check for
those calls too).

It took me a while to understand why walreceiver does this originally,
but it did make sense to me once I understood the split between the main
(idle) loop and busy loop, so I did it the same way.

>> I guess it would make sense to do conditional exit on
>> (PQstatus(streamConn) == CONNECTION_BAD) like libpqrcv_PQexec does. It's
>> quite ugly code-wise though.
>
> I think that's fine for now. It'd imo be a good idea to improve matters
> here a bit, but for now I've just applied your patch.
>

Okay, thanks.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Petr Jelinek 2017-06-09 01:20:03 Re: logical replication busy-waiting on a lock
Previous Message Thomas Munro 2017-06-08 23:28:01 Re: PG10 transition tables, wCTEs and multiple operations on the same table