Re: pgsql: Make new crash restart test a bit more robust.

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-committers(at)postgresql(dot)org
Subject: Re: pgsql: Make new crash restart test a bit more robust.
Date: 2017-09-19 20:53:18
Message-ID: 20170919205318.sm3vcmsngd633kpd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers pgsql-hackers

On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > # This probably means the server terminated abnormally
> > # before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
>
> That seems pretty weird. Maybe it's not the same case, but in
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
>
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
>
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG: connection received: host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG: connection authorized: user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG: statement: SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING: terminating connection because of crash of another server process
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT: In a moment you should be able to reconnect to the database and repeat your command.

I think it's likely the same - I've observed the same with the added
instrumentation.

> Have we forgotten an fflush() or something?
>
> Also, maybe problem is on client side. I vaguely recall a libpq bug
> wherein it would complain about socket EOF even though data remained
> to be processed. Maybe we reintroduced something like that?

That seems quite possible.

> > We can obviously easily make the test accept both - but are we ok with
> > the client sometimes not getting the message?
>
> I'm not ...

Same here.

I'll see if I can spot the bug in an hour or two. If not I'll make the
test temporarily accept both outputs while investigating?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Andres Freund 2017-09-19 21:02:28 Re: pgsql: Speedup pgstat_report_activity by moving mb-aware truncation to
Previous Message Tom Lane 2017-09-19 20:53:09 Re: pgsql: Speedup pgstat_report_activity by moving mb-aware truncation to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-09-19 20:59:07 Re: Re: issue: record or row variable cannot be part of multiple-item INTO list
Previous Message Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?= 2017-09-19 20:50:26 Re: Show backtrace when tap tests fail