From: | James Coleman <jtc331(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays |
Date: | 2020-04-26 18:46:19 |
Message-ID: | CAAaqYe9J1fn2jUYF395XZiEg3Bn5ADCcXx8c9+1H+nxm0q6wHA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowleyml(at)gmail(dot)com> wrote:
> >>
> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > This reminds me our attempts to add bloom filters to hash joins, which I
> >> > think ran into mostly the same challenge of deciding when the bloom
> >> > filter can be useful and is worth the extra work.
> >>
> >> Speaking of that, it would be interesting to see how a test where you
> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> be interesting to know if the planner is able to make a more suitable
> >> choice and also to see how all the work over the years to improve Hash
> >> Joins compares to the bsearch with and without the bloom filter.
> >
> >It would be interesting.
> >
> >It also makes one wonder about optimizing these into to hash
> >joins...which I'd thought about over at [1]. I think it'd be a very
> >significant effort though.
> >
>
> I modified the script to also do the join version of the query. I can
> only run it on my laptop at the moment, so the results may be a bit
> different from those I shared before, but it's interesting I think.
>
> In most cases it's comparable to the binsearch/bloom approach, and in
> some cases it actually beats them quite significantly. It seems to
> depend on how expensive the comparison is - for "int" the comparison is
> very cheap and there's almost no difference. For "text" the comparison
> is much more expensive, and there are significant speedups.
>
> For example the test with 100k lookups in array of 10k elements and 10%
> match probability, the timings are these
>
> master: 62362 ms
> binsearch: 201 ms
> bloom: 65 ms
> hashjoin: 36 ms
>
> I do think the explanation is fairly simple - the bloom filter
> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> some overhead to build and check the bits. The hash join probably
> eliminates a lot of the remaining comparisons, because the hash table
> is sized to have one tuple per bucket.
>
> Note: I also don't claim the PoC has the most efficient bloom filter
> implementation possible. I'm sure it could be made faster.
>
> Anyway, I'm not sure transforming this to a hash join is worth the
> effort - I agree that seems quite complex. But perhaps this suggest we
> should not be doing binary search and instead just build a simple hash
> table - that seems much simpler, and it'll probably give us about the
> same benefits.
That's actually what I originally thought about doing, but I chose
binary search since it seemed a lot easier to get off the ground.
If we instead build a hash is there anything else we need to be
concerned about? For example, work mem? I suppose for the binary
search we already have to expand the array, so perhaps it's not all
that meaningful relative to that...
I was looking earlier at what our standard hash implementation was,
and it seemed less obvious what was needed to set that up (so binary
search seemed a faster proof of concept). If you happen to have any
pointers to similar usages I should look at, please let me know.
James
From | Date | Subject | |
---|---|---|---|
Next Message | Jonathan S. Katz | 2020-04-26 19:21:38 | Re: Poll: are people okay with function/operator table redesign? |
Previous Message | Justin Pryzby | 2020-04-26 17:56:14 | Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly |