Re: [sqlsmith] Unpinning error in parallel worker

From: Jonathan Rudenberg <jonathan(at)titanous(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [sqlsmith] Unpinning error in parallel worker
Date: 2018-04-24 20:15:31
Message-ID: 1524600931.3892926.1349473696.17D84D5D@webmail.messagingengine.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
> On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
> <jonathan(at)titanous(dot)com> wrote:
> > This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging backends.
> >
> > [...]
> > #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> > #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> > #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> > [...]
>
> Thank you. These stacks are all blocked trying to acquire
> DynamicSharedMemoryControlLock. My theory is that they can't because
> one backend -- the one that emitted the error "FATAL: cannot unpin a
> segment that is not pinned" -- is deadlocked against itself. After
> emitting that error you can see from Andreas's "seabisquit" stack that
> that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
> which tries to acquire DynamicSharedMemoryControlLock again, even
> though we already hold it at that point.
>
> I'll write a patch to fix that unpleasant symptom. While holding
> DynamicSharedMemoryControlLock we shouldn't raise any errors without
> releasing it first, because the error handling path will try to
> acquire it again. That's a horrible failure mode as you have
> discovered.
>
> But that isn't the root problem: we shouldn't be raising that error,
> and I'd love to see the stack of the one process that did that and
> then self-deadlocked. I will have another go at trying to reproduce
> it here today.

Thanks for the update!

We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of this bug is unfortunately quite problematic.

What will this failure look like with the patch you've proposed?

Thanks again,

Jonathan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Evan Carroll 2018-04-24 20:35:16 Re: Extending a bit string
Previous Message Thomas Munro 2018-04-24 20:06:43 Re: [sqlsmith] Unpinning error in parallel worker