| From: | Jonathan Rudenberg <jonathan(at)titanous(dot)com> |
|---|---|
| To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
| Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Seltenreich <seltenreich(at)gmx(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: [sqlsmith] Unpinning error in parallel worker |
| Date: | 2018-04-24 20:15:31 |
| Message-ID: | 1524600931.3892926.1349473696.17D84D5D@webmail.messagingengine.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
> On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
> <jonathan(at)titanous(dot)com> wrote:
> > This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging backends.
> >
> > [...]
> > #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> > #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> > #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> > [...]
>
> Thank you. These stacks are all blocked trying to acquire
> DynamicSharedMemoryControlLock. My theory is that they can't because
> one backend -- the one that emitted the error "FATAL: cannot unpin a
> segment that is not pinned" -- is deadlocked against itself. After
> emitting that error you can see from Andreas's "seabisquit" stack that
> that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
> which tries to acquire DynamicSharedMemoryControlLock again, even
> though we already hold it at that point.
>
> I'll write a patch to fix that unpleasant symptom. While holding
> DynamicSharedMemoryControlLock we shouldn't raise any errors without
> releasing it first, because the error handling path will try to
> acquire it again. That's a horrible failure mode as you have
> discovered.
>
> But that isn't the root problem: we shouldn't be raising that error,
> and I'd love to see the stack of the one process that did that and
> then self-deadlocked. I will have another go at trying to reproduce
> it here today.
Thanks for the update!
We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of this bug is unfortunately quite problematic.
What will this failure look like with the patch you've proposed?
Thanks again,
Jonathan
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Evan Carroll | 2018-04-24 20:35:16 | Re: Extending a bit string |
| Previous Message | Thomas Munro | 2018-04-24 20:06:43 | Re: [sqlsmith] Unpinning error in parallel worker |