Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)

From: Antonin Houska <ah(at)cybertec(dot)at>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: walsender "wakeup storm" on PG16, likely because of bc971f4025c (Optimize walsender wake up logic using condition variables)
Date: 2023-08-17 12:25:40
Message-ID: 12156.1692275140@antos
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:

> On Wed, Aug 16, 2023 at 11:18 PM Antonin Houska <ah(at)cybertec(dot)at> wrote:
> > I try to understand this patch (commit 5ffb7c7750) because I use condition
> > variable in an extension. One particular problem occured to me, please
> > consider:
> >
> > ConditionVariableSleep() gets interrupted, so AbortTransaction() calls
> > ConditionVariableCancelSleep(), but the signal was sent in between. Shouldn't
> > at least AbortTransaction() and AbortSubTransaction() check the return value
> > of ConditionVariableCancelSleep(), and re-send the signal if needed?
>
> I wondered about that in the context of our only in-tree user of
> ConditionVariableSignal(), in parallel btree index creation, but since
> it's using the parallel executor infrastructure, any error would be
> propagated everywhere so all waits would be aborted.

I see, ConditionVariableSignal() is currently used only to signal other
workers running in the same transactions. The other parts use
ConditionVariableBroadcast(), so no consumer should miss its signal.

> > Note that I'm just thinking about such a problem, did not try to reproduce it.
>
> Hmm. I looked for users of ConditionVariableSignal() in the usual web
> tools and didn't find anything, so I guess your extension is not
> released yet or not open source. I'm curious: what does it actually
> do if there is an error in a CV-wakeup-consuming backend? I guess it
> might be some kind of work-queue processing system... it seems
> inevitable that if backends are failing with errors, and you don't
> respond by retrying/respawning, you'll lose or significantly delay
> jobs/events/something anyway (imagine only slightly different timing:
> you consume the signal and start working on a job and then ereport,
> which amounts to the same thing in the end now that your transaction
> is rolled back?), and when you retry you'll see whatever condition was
> waited for anyway. But that's just me imagining what some
> hypothetical strawman system might look like... what does it really
> do?

If you're interested, the extension is pg_squeeze [1]. I think the use case is
rather special. All the work is done by a background worker, but an user
function can be called to submit a "task" for the worker and wait for its
completion. So the function sleeps on a CV and the worker uses the CV to wake
it up. If this function ends due to ERROR, the user is supposed to find a log
message in the worker output sooner or later. It may sound weird, but that
function exists primarily for regression tests, so ERROR is a problem anyway.

> (FWIW when I worked on a couple of different work queue-like systems
> and tried to use ConditionVariableSignal() I eventually concluded that
> it was the wrong tool for the job because its wakeup order is
> undefined. It's actually FIFO, but I wanted LIFO so that workers have
> a chance to become idle and reduce the pool size, but I started to
> think that once you want that level of control you really want to
> build a bespoke wait list system, so I never got around to proposing
> that we consider changing that.)
>
> Our condition variables are weird. They're not associated with a
> lock, so we made start-of-wait non-atomic: prepare first, then return
> control and let the caller check its condition, then sleep. Typical
> user space condition variable APIs force you to acquire some kind of
> lock that protects the condition first, then check the condition, then
> atomically release-associated-lock-and-start-sleeping, so there is no
> data race but also no time where control is returned to the caller but
> the thread is on the wait list consuming signals. That choice has
> some pros (you can choose whatever type of lock you want to protect
> your condition, or none at all if you can get away with memory
> barriers and magic) and cons.. However, as I think Andres was getting
> at, having a non-atomic start-of-wait doesn't seem to require us to
> have a non-atomic end-of-wait and associated extra contention. So
> maybe we should figure out how to fix that in 17.

Thanks for sharing your point of view. I'm fine with this low-level approach:
it's well documented and there are examples in the PG code showing how it
should be used :-)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

https://github.com/cybertec-postgresql/pg_squeeze/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-08-17 12:36:30 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Frédéric Yhuel 2023-08-17 12:00:59 Re: Allow parallel plan for referential integrity checks?