Re: Hybrid Hash/Nested Loop joins and caching results from subplans

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Hybrid Hash/Nested Loop joins and caching results from subplans
Date: 2020-11-27 00:10:43
Message-ID: CAApHDvrSUUSGk_bdKsnLn34OTn=-WseySazeMK7i26zikrsoog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for having another look at this.

> On Sun, Nov 22, 2020 at 9:21 PM Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com> wrote:
> add 2 more comments.
>
> 1. I'd suggest adding Assert(false); in RC_END_OF_SCAN case to make the error clearer.
>
> case RC_END_OF_SCAN:
> /*
> * We've already returned NULL for this scan, but just in case
> * something call us again by mistake.
> */
> return NULL;

This just took some inspiration from nodeMaterial.c where it says:

/*
* If necessary, try to fetch another row from the subplan.
*
* Note: the eof_underlying state variable exists to short-circuit further
* subplan calls. It's not optional, unfortunately, because some plan
* node types are not robust about being called again when they've already
* returned NULL.
*/

I'm not feeling a pressing need to put an Assert(false); in there as
it's not what nodeMaterial.c does. nodeMaterial is nodeResultCache's
sister node which can also be seen below Nested Loops.

> 2. Currently we handle the (!cache_store_tuple(node, outerslot))) case by set it
> to RC_CACHE_BYPASS_MODE. The only reason for the cache_store_tuple failure is
> we can't cache_reduce_memory. I guess if cache_reduce_memory
> failed once, it would not succeed later(no more tuples can be stored,
> nothing is changed). So I think we can record this state and avoid any new
> cache_reduce_memory call.
>
> /*
> * If we failed to create the entry or failed to store the
> * tuple in the entry, then go into bypass mode.
> */
> if (unlikely(entry == NULL ||
> !cache_store_tuple(node, outerslot)))
>
> to
>
> if (unlikely(entry == NULL || node->memory_cant_be_reduced ||
> !cache_store_tuple(node, outerslot)))

The reason for RC_CACHE_BYPASS_MODE is if there's a single set of
parameters that have so many results that they, alone, don't fit in
the cache. We call cache_reduce_memory() whenever we go over our
memory budget. That function returns false if it was unable to free
enough memory without removing the "specialkey", which in this case is
the current cache entry that's being populated. Later, when we're
caching some entry that isn't quite so large, we still want to be able
to cache that. In that case, we'll have removed the remnants of the
overly large entry that didn't fit to way for newer and, hopefully,
smaller entries. No problems. I'm not sure why there's a need for
another flag here.

A bit more background.

When caching a new entry, or finding an existing entry, we move that
entry to the top of the MRU dlist. When adding entries or tuples to
existing entries, if we've gone over memory budget, then we remove
cache entries from the MRU list starting at the tail (lease recently
used). If we begin caching tuples for an entry and need to free some
space, then since we've put the current entry to the top of the MRU
list, it'll be the last one to be removed. However, it's still
possible that we run through the entire MRU list and end up at the
most recently used item. So the entry we're populating can also be
removed if freeing everything else was still not enough to give us
enough free memory. The code refers to this as a cache overflow. This
causes the state machine to move into RC_CACHE_BYPASS_MODE mode. We'll
just read tuples directly from the subnode in that case, no need to
attempt to cache them. They're not going to fit. We'll come out of
RC_CACHE_BYPASS_MODE when doing the next rescan with a different set
of parameters. This is our chance to try caching things again. The
code does that. There might be far fewer tuples for the next parameter
we're scanning for, or those tuples might be more narrow. So it makes
sense to give caching them another try. Perhaps there's some point
where we should give up doing that, but given good statistics, it's
unlikely the planner would have thought a result cache would have been
worth the trouble and would likely have picked some other way to
execute the plan. The planner does estimate the average size of a
cache entry and calculates how many of those fit into a hash_mem. If
that number is too low then Result Caching the inner side won't be too
appealing. Of course, calculating the average does not mean there are
no outliers. We'll deal with the large side of the outliers with the
bypass code.

I currently don't really see what needs to be changed about that.

David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Arne Roland 2020-11-27 00:19:57 Re: Rename of triggers for partitioned tables
Previous Message Arne Roland 2020-11-27 00:05:02 Re: Rename of triggers for partitioned tables