Re: Condition variable live lock

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Condition variable live lock
Date: 2018-01-05 19:38:37
Message-ID: 9342.1515181117@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> As I feared, the existing regression tests are not really adequate for
> this: gcov testing shows that the sentinel-inserting code path is
> never entered, meaning ConditionVariableBroadcast never sees more
> than one waiter.

Hmm ... adding tracing log printouts in BarrierArriveAndWait disproves
that: the regression tests do, repeatedly, call ConditionVariableBroadcast
when there are two waiters to be woken (and you can get it to be more if
you raise the number of parallel workers specified in the PHJ tests).
It just doesn't happen with --enable-coverage. Heisenberg wins again!

I am not sure why coverage tracking changes the behavior so much, but
it's clear from my results that if you change all the PHJ test cases in
join.sql to use 4 parallel workers, then you get plenty of barrier release
events with 4 or 5 barrier participants --- but running the identical test
under --enable-coverage results in only a very small number of releases
with even as many as 2 participants, let alone more. Perhaps the PHJ test
cases don't run long enough to let slow-starting workers join in?

Anyway, that may or may not indicate something we should tune at a higher
level, but I'm now satisfied that the patch as presented works and does
get tested by our existing tests. So barring objection I'll commit that
shortly.

There are some other things I don't like about condition_variable.c:

* I think the Asserts in ConditionVariablePrepareToSleep and
ConditionVariableSleep ought to be replaced by full-fledged test and
elog(ERROR), so that they are enforced even in non-assert builds.
I don't have a lot of confidence that corner cases that could violate
those usage restrictions would get caught during developer testing.
Nor do I see an argument that we can't afford the cycles to check.

* ConditionVariablePrepareToSleep needs to be rearranged so that failure
to create the WaitEventSet doesn't leave us in an invalid state.

* A lot of the comments could be improved, IMHO.

Barring objection I'll go deal with those things, too.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2018-01-05 19:46:46 Re: pgsql: pg_upgrade: simplify code layout in a few places
Previous Message Robert Haas 2018-01-05 19:37:58 Re: pgsql: pg_upgrade: simplify code layout in a few places