Re: BUG #14420: Parallel worker segfault

From: Rick Otten <rotten(at)windfish(dot)net>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14420: Parallel worker segfault
Date: 2016-11-14 14:18:42
Message-ID: 8029759f8564c4960af1d15a544a8826@www.windfish.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Sorry about forgetting to CC the bugs list when I replied.

I've enabled "-c", and made sure my PGDATA directory has enough space to
collect a full core image. If we get one, I'll let you know.

There were a lot of queries happening at the time of the seg fault. The
only new or unusual one that I am aware of was doing a UNION ALL,
between two nearly identical queries, where one side was doing a
parallel query scan, and the other side wasn't. I had just refactored it
from using an "OR" operand in the WHERE clause because it was much
faster that way.

Since Friday I ran another 1M or so of those queries, but it hasn't seg
faulted again.

On 2016-11-12 02:18, Amit Kapila wrote:

> On Sat, Nov 12, 2016 at 9:32 AM, Rick Otten <rotten(at)windfish(dot)net> wrote:
>
> Please keep pgsql-bugs in the loop. It is important to keep everyone
> in the loop not only because it is a way to work in this community,
> but also because others can see something which I or you can't see.
>
>> PostgreSQL was not started with the "-c" option. I'll look into enabling that before this happens again.
>
> makes sense.
>
>> I'll read more from the other debugging article and see if there is anything I can do there as well. Thanks. There were no files generated and dropped in PGDATA this time, unfortunately. Sorry, I know this isn't much to go on, but it is all I know at this time. There wasn't much else that wasn't routine in the logs before or after the two lines I pasted below other than a bunch of warnings for the the 30 or 40 transactions that were in progress followed by this:
>
> Okay, I think we can't get anything from these logs. I think once
> core is available, we can try to find the reason, but it would be much
> better if we can generate an independent test to reproduce this
> problem. One possible way could be to find the culprit query. You
> might want to log long-running queries, as parallelism will generally
> be used for such queries.
>
>> 2016-11-11 21:31:26.292 UTC WARNING: terminating connection because of crash of another server process 2016-11-11 21:31:26.292 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.292 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. 2016-11-11 21:31:26.301 UTC WARNING: terminating connection because of crash of another server process 2016-11-11 21:31:26.301 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2016-11-11 21:31:26.301 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. 2016-11-11 21:31:30.762 UTC [unknown] x.x.x.x [unknown] LOG: connection received: host=x.x.x.x port=47692
2016-11-11 21:31:30.762 UTC clarivoy x.x.x.x some_user FATAL: the database system is in recovery mode 2016-11-11 21:31:31.766 UTC LOG: all server processes terminated; reinitializing 2016-11-11 21:31:33.526 UTC LOG: database system was interrupted; last known up at 2016-11-11 21:29:28 UTC 2016-11-11 21:31:33.660 UTC LOG: database system was not properly shut down; automatic recovery in progress 2016-11-11 21:31:33.674 UTC LOG: redo starts at 1DD/4F5A0320 2016-11-11 21:31:33.957 UTC LOG: unexpected pageaddr 1DC/16AEC000 in log segment 00000001000001DD00000056, offset 11452416 2016-11-11 21:31:33.958 UTC LOG: redo done at 1DD/56AEB7F8 2016-11-11 21:31:33.958 UTC LOG: last completed transaction was at log time 2016-11-11 21:31:26.07448+00 2016-11-11 21:31:34.705 UTC LOG: MultiXact member wraparound protections are now enabled 2016-11-11 21:31:34.724 UTC LOG: autovacuum launcher started 2016-11-11 21:31:34.725 UTC LOG: database system is ready to accept connections After that the
database was pretty much back to normal. Because everything connects from various pgbouncer instances running elsewhere, they quickly reconnected and started working again without having to restart any applications or services.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Josh Berkus 2016-11-14 19:31:22 DOS-style line endings in .pgpass
Previous Message mikael.wallen 2016-11-14 09:07:17 BUG #14423: Fail to connect to DB