Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2021-02-23 01:21:52
Message-ID: CAKU4AWrQHoeia2Y6T8x3xc5z8o9tU9ot2B9KKOZxva6LRC1=nQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 22, 2021 at 9:21 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:

> On Tue, Feb 16, 2021 at 11:15:51PM +1300, David Rowley wrote:
> > To summarise here, the planner performance gets a fair bit worse with
> > the patched code. With master, summing the average planning time over
> > each of the queries resulted in a total planning time of 765.7 ms.
> > After patching, that went up to 1097.5 ms. I was pretty disappointed
> > about that.
>
> I have a couple ideas;
>
> - default enable_resultcache=off seems okay. In plenty of cases, planning
> time is unimportant. This is the "low bar" - if we can do better and
> enable
> it by default, that's great.
>
> - Maybe this should be integrated into nestloop rather than being a
> separate
> plan node. That means that it could be dynamically enabled during
> execution, maybe after a few loops or after checking that there's at
> least
> some minimal number of repeated keys and cache hits. cost_nestloop
> would
> consider whether to use a result cache or not, and explain would show
> the
> cache stats as a part of nested loop.

+1 for this idea now.. I am always confused why there is no such node in
Oracle
even if it is so aggressive to do performance improvement and this function
looks very promising. After realizing the costs in planner, I think
planning time
might be an answer (BTW, I am still not sure Oracle did this).

In this case, I propose there'd still
> be a GUC to disable it.
>
> - Maybe cost_resultcache() can be split into initial_cost and final_cost
> parts, same as for nestloop ? I'm not sure how it'd work, since
> initial_cost is supposed to return a lower bound, and resultcache tries
> to
> make things cheaper. initial_cost would just add some operator/tuple
> costs
> to make sure that resultcache of a unique scan is more expensive than
> nestloop alone. estimate_num_groups is at least O(n) WRT
> rcpath->param_exprs, so maybe you charge 100*list_length(param_exprs) *
> cpu_operator_cost in initial_cost and then call estimate_num_groups in
> final_cost. We'd be estimating the cost of estimating the cost...
>
> - Maybe an initial implementation of this would only add a result cache
> if the
> best plan was already going to use a nested loop, even though a cached
> nested loop might be cheaper than other plans. This would avoid most
> planner costs, and give improved performance at execution time, but
> leaves
> something "on the table" for the future.
>
> > +cost_resultcache_rescan(PlannerInfo *root, ResultCachePath *rcpath,
> > + Cost *rescan_startup_cost, Cost *rescan_total_cost)
> > +{
> > + double tuples = rcpath->subpath->rows;
> > + double calls = rcpath->calls;
> ...
> > + /* estimate on the distinct number of parameter values */
> > + ndistinct = estimate_num_groups(root, rcpath->param_exprs, calls,
> NULL,
> > + &estinfo);
>
> Shouldn't this pass "tuples" and not "calls" ?
>
> --
> Justin
>

--
Best Regards
Andy Fan (https://www.aliyun.com/)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2021-02-23 01:23:53 Re: 64-bit XIDs in deleted nbtree pages
Previous Message Andres Freund 2021-02-23 01:19:37 Re: Fallback table AM for relkinds without storage