Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2 parallel gathers

From: Chris Snook <csnook(at)cloudflare(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Occasional 9.6.10 PMChildFlags fatal error, possibly due to >2 parallel gathers
Date: 2019-02-13 00:51:22
Message-ID: CAONUJSM5X259vAnnwSpqu=VnRECfGSJ-CgRHyS4P5YyRVwkXsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I'll start by saying that I don't have as much information as I'd like to
make a proper bug report, but I want to throw this out there in case anyone
else has seen something similar and can point me in the right direction for
reproducing it.

I have a Debian Stretch system running the 9.6.10 build from Debian APT
repos. It has been running flawlessly, lightly loaded, for about a year and
a half, only being restarted for security updates. It recently crashed
twice about an hour apart due to postmaster receiving a signal from a
worker task, both times with this complaint in the log:

FATAL: no free slots in PMChildFlags array

For more context, I got these tightly packed around the first crash, with
the first and last messages repeated hundreds of times:

FATAL: sorry, too many clients already
FATAL: sorry, too many clients already
FATAL: sorry, too many clients already
FATAL: no free slots in PMChildFlags array
WARNING: could not remove shared memory segment "/PostgreSQL.1407760088":
No such file or directory
FATAL: semop(id=2293786) failed: Invalid argument
FATAL: semop(id=2293786) failed: Invalid argument
FATAL: semctl(2064403, 7, SETVAL, 0) failed: Invalid argument
FATAL: semop(id=2621476) failed: Invalid argument
FATAL: semop(id=2621476) failed: Invalid argument
FATAL: semctl(2293786, 1, SETVAL, 0) failed: Invalid argument
FATAL: semctl(2621476, 10, SETVAL, 0) failed: Invalid argument
WARNING: could not remove shared memory segment "/PostgreSQL.1621779631":
No such file or directory
LOG: database system is shut down
FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to unexpected postmaster exit

The second time, I got this:

FATAL: no free slots in PMChildFlags array
FATAL: semctl(3604480, 14, SETVAL, 0) failed: Invalid argument
FATAL: semctl(3604480, 13, SETVAL, 0) failed: Invalid argument
FATAL: semctl(3604480, 12, SETVAL, 0) failed: Invalid argument
FATAL: semctl(3604480, 11, SETVAL, 0) failed: Invalid argument
FATAL: semctl(3604480, 10, SETVAL, 0) failed: Invalid argument
LOG: database system is shut down
FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to unexpected postmaster exit

followed hundreds of lines of terminating connection later by this:

FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to unexpected postmaster exit
FATAL: terminating connection due to administrator command
STATEMENT: SELECT column1 FROM table1 WHERE (column2='value1') AND
(column3->>'key1'='value2') ORDER BY created ASC LIMIT 1
WARNING: could not remove shared memory segment "/PostgreSQL.1032869518":
No such file or directory
FATAL: postmaster exited during a parallel transaction
LOG: database system was interrupted; last known up at 2019-02-12 20:13:55
GMT
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: incomplete startup packet
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up

and hundreds of database start lines later by this:

FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: redo starts at E/D1504EA0
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: invalid record length at E/D15A21B0: wanted 24, got 0
LOG: redo done at E/D15A2188
LOG: last completed transaction was at log time 2019-02-12
20:15:58.259264+00
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up

Possibly of interest, the last completed transaction was 2 minutes after
the last known time up.

I asked the application developers using the cluster if they had recently
made any changes, and they said they had not. A Google search for the
PMChildFlags error message found nothing recent. Since it seemed to be
related to parallel queries, I looked up that behavior, and found a post
that indicates that it'll use 3 workers if the table size is greater than
72 MB and configuration allows. We had max_parallel_workers_per_gather set
at 10 for performance testing (the workload in no way needed that), so that
was a possibility. I checked the sizes of tables in the cluster, and found
that the table that the crashing query was accessing is 80 MB, and no other
table in the cluster is that large.

My suspicion is that there's some sort of race condition that requires at
least 3 parallel gather workers to hit, and most people aren't hitting it
because they're not setting it that high, and there's something specific
about my workload that triggers it. I think we started hitting it shortly
after an autovacuum updated the query planner stats to reflect that the
table was larger than 72 MB, which is why we didn't hit it at 72 MB.

Unfortunately, we didn't have more detailed timestamped logs at the time to
confirm this theory, and it hasn't happened a third time. We've lowered
max_parallel_workers_per_gather to 2, since this is a production system,
but we may be able to rig up a reproducer with some guidance.

- Chris

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Saul, Jean Paolo 2019-02-13 01:48:21 Re: BUG #15609: synchronous_commit=off insert performance regression with secondary indexes
Previous Message Tom Lane 2019-02-12 21:21:46 Re: BUG #15633: Data loss when reading the data from logical replication slot