Quick Links

Re: bad estimation together with large work_mem generates terrible slow hash joins

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: bad estimation together with large work_mem generates terrible slow hash joins
Date:	2014-09-09 22:49:37
Message-ID:	540F8401.6040808@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 9.9.2014 16:09, Robert Haas wrote:
> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> So I only posted the separate patch for those who want to do a review,
>> and then a "complete patch" with both parts combined. But it sure may be
>> a bit confusing.
>
> Let's do this: post each new version of the patches only on this
> thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

>>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>>> nbatch, could we just compute it before working out dbatch, and just
>>> deduct it from hash_table_bytes? That seems like it'd do the same
>>> thing for less work.
>>
>> Well, we can do that. But I really don't think that's going to make
>> measurable difference. And I think the code is more clear this way.
>
> Really? It seems to me that you're doing more or less the same
> calculation that's already been done a second time. It took me 20
> minutes of staring to figure out what it was really doing. Now
> admittedly I was a little low on caffeine, but...

I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.

>
>>> Either way, what happens if buckets_bytes is already bigger than
>>> hash_table_bytes?
>>
>> Hm, I don't see how that could happen.
>
> How about an Assert() and a comment, then?

Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:

Assert(buckets_bytes <= hash_table_bytes/2);

And then we subtract the buckets_bytes, and continue with the loop.

>
>>> A few more stylistic issues that I see:
>>>
>>> + if ((hashtable->nbatch == 1) &&
>>> + if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>>> + if (size > (HASH_CHUNK_SIZE/8))
>>>
>>> While I'm all in favor of using parentheses generously where it's
>>> needed to add clarity, these and similar usages seem over the top to
>>> me.
>>
>> It seemed more readable for me at the time of coding it, and it still
>> seems better this way, but ...
>>
>> https://www.youtube.com/watch?v=CxK_nA2iVXw
>
> Heh. Well, at any rate, I think PostgreSQL style is to not use
> parentheses in that situation.

FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.

>
>>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>>
>>> Eh... no.
>>>
>>> + /* XXX maybe we should use MAXALIGN(size) here ... */
>>>
>>> +1.
>>
>> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
>> should be on the next line? And is the +1 about doing MAXALING?
>
> The "no" is about the fact that you have the return type, the function
> name, and the opening brace on one line instead of three lines as is
> project style. The +1 is for applying MAXALIGN to the size.

OK, fixed.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
The original naming seemed a bit awkward.

* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
Seems like a better fit to what the function does.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
threshold for 'oversized' items

We need the threshold to be >=8kB, to trigger the special case
within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.

Tomas

Attachment	Content-Type	Size
dense-alloc-v5.patch	text/x-diff	8.1 KB
hashjoin-nbuckets-v12.patch	text/x-diff	14.9 KB

In response to

Re: bad estimation together with large work_mem generates terrible slow hash joins at 2014-09-09 14:09:59 from Robert Haas

Responses

Re: bad estimation together with large work_mem generates terrible slow hash joins at 2014-09-10 18:25:32 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2014-09-09 23:21:01	Re: FD_SETSIZE on Linux?
Previous Message	Thom Brown	2014-09-09 22:46:59	FD_SETSIZE on Linux?