From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | Alexandr Nikulin <alexandr(dot)s(dot)nikulin(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Improve the performance of nested loop join in the case of partitioned inner table |
Date: | 2023-07-04 12:02:06 |
Message-ID: | CAApHDvpFKU6C3LNJZMF1cH6+-Ataf3oPBX3Dy7JH-q7kHb5EDA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, 13 Apr 2023 at 03:00, Alexandr Nikulin
<alexandr(dot)s(dot)nikulin(at)gmail(dot)com> wrote:
> explain analyze select * from ids join test_part on ids.id=test_part.id where ascii(ids.name)=ascii('best case');
> explain analyze select * from ids join test_part on ids.id=test_part.id where ascii(ids.name)=ascii('worst case');
>
> The average results on my machine are as follows:
>
> | vanila postgres | patched postgres
> best case | 2286 ms | 1924 ms
> worst case | 2278 ms | 2360 ms
>
> So far I haven't refactored the patch as per David's advice. I just want to understand if we need such an optimization?
My thoughts are that the worst-case numbers are not exactly great. I
very much imagine that with average cases, it's much more likely than
not that the parameter values from the nested loop's next outer row
will be different than it is that it'll be the same.
Let's say, roughly your numbers show a 20% speedup for the best case,
and a 4% slowdown for the worst case, for us to break even with this
patch as it is, the parameter value would have to be the same around 1
out of 5 times. That does not seem like good odds to bet on given
we're likely working with data types that allow billions of distinct
values.
I think if you really wanted to make this work, then you'd need to get
the planner on board with making the decision on if this should be
done or not based on the n_distinct estimates from the outer side of
the join. Either that or some heuristic in the executor that tries
for a while and gives up if the parameter value changes too often.
Some code was added in 3592e0ff9 that uses a heuristics approach to
solving this problem by only enabling the optimisation if we hit the
same partition at least 16 times and switches it off again as soon as
the datum no longer matches the cached partition. I'm not quite sure
how the same could be made to work here as with 3592e0ff9. A tuple
only belongs to a single partition and we can very cheaply check if
this partition is the same as the last one by checking if the
partition index matches. With this case, since we're running a query,
many partitions can remain after partition pruning runs, and checking
that some large number of partitions match some other large number of
partitions is not going to be as cheap as checking just two partitions
match. Bitmapsets can help here, but they'll just never be as fast as
checking two ints match.
In short, I think you're going to have to come up with something very
crafty here to reduce the overhead worst case. Whatever it is will
need to be neat and self-contained, perhaps in execPartition.c. It
just does not seem good to have logic related to partition pruning
inside nodeNestloop.c.
I'm going to mark this as waiting on author in the CF app. It might be
better if you withdraw it and resubmit when you have a patch that
addresses the worst-case regression issue.
David
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2023-07-04 12:14:41 | Re: Clean up command argument assembly |
Previous Message | Heikki Linnakangas | 2023-07-04 11:59:40 | Re: pipe_read_line for reading arbitrary strings |