From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, pgsql-committers <pgsql-committers(at)postgresql(dot)org> |
Subject: | Re: pgsql: Add parallel-aware hash joins. |
Date: | 2017-12-28 04:15:00 |
Message-ID: | CAEepm=1OyZcCAxAEWYhmeg0qmWyYF2O2yW7s7A4uVTt-dgRRyQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-committers pgsql-hackers |
On Thu, Dec 28, 2017 at 3:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> I'll address the instability of the regression test output separately.
>
> If you're still looking for data on that --- prairiedog is able to
> reproduce the "multibatch = f" variant about one time in thirty.
> I modified the test case to print out the full EXPLAIN ANALYZE output
> rather than a heavily filtered version. Here's a typical successful run:
>
> ! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
> ! Execution time: 424.395 ms
>
> and here's a failing run:
>
> ! Buckets: 1024 (originally 2048) Batches: 1 (originally 1) Memory Usage: 0kB
> ! Execution time: 243.120 ms
>
> I don't have enough insight to be totally sure what this means, but the
> "Memory Usage: 0kB" bit is obviously bogus, so I'd venture that at least
> part of the issue is failure to return stats from a worker.
Hmm. Yeah, that seems quite likely -- thanks. Investigating now.
> I also find
> it most curious that the "success" case is a lot slower than the "not
> success" case. Perhaps this is related to your livelock issue? Doing
> another run, I get something even slower:
>
> ! Buckets: 1024 (originally 2048) Batches: 8 (originally 1) Memory Usage: 321kB
> ! Execution time: 509.607 ms
Yes, that looks a lot like the ConditionVariableBroadcast() livelock
problem. That query with the same 8-batch 2-worker plan runs in ~12ms
and ~19ms for me on two different machines here without the run-to-run
variation you see.
> Aside from the instability problems, I'm pretty unhappy about how much
> the PHJ patch has added to the runtime of "make check". I do not think
> any one feature can justify adding 20% to that. Can't you cut down the
> amount of data processed by these new test cases?
Isn't that mostly because of the CV livelock problem? Here are the
real times I got with "time make check" on my slow 2 core Intel(R)
Celeron(R) CPU G1610T @ 2.30GHz running FreeBSD, which seems to be
prone to that problem on certain queries:
84940644 = 34.83s (before hash join regression tests)
fa330f9a = 35.81s (hash join regression tests)
18042840 = 44.92s (parallel-aware hash joins + new tests)
So the PHJ patch made it take 25% longer, similar to what you
reported. But then if I apply the following band-aid kludge to
condition_variable.c to limit the looping, I get:
- while (ConditionVariableSignal(cv))
+ while (nwoken < max_parallel_workers && ConditionVariableSignal(cv))
++nwoken;
18042840 + kludge = 36.70s
So without the effects of that bug it's only taking 2.4% longer than
commit fa330f9a. Is that acceptable for a feature of this size and
complexity? I will also look into making the data sets smaller.
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-12-28 04:26:35 | Re: pgsql: Add parallel-aware hash joins. |
Previous Message | Tom Lane | 2017-12-28 02:32:26 | Re: pgsql: Add parallel-aware hash joins. |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-12-28 04:26:35 | Re: pgsql: Add parallel-aware hash joins. |
Previous Message | Craig Ringer | 2017-12-28 02:51:02 | Re: Contributing some code |