Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: Ilia Evdokimov <ilya(dot)evdokimov(at)tantorlabs(dot)com>, Lukas Fittl <lukas(at)fittl(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Date: 2025-03-21 09:02:36
Message-ID: ff747c1d-1925-45ca-9c47-2763be45cba0@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21/3/2025 03:50, David Rowley wrote:
> On Fri, 21 Mar 2025 at 07:54, Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
>> I have some doubts here.
>> The number of distinct values says something only when it has taken
>> together with the number of calls.
>
> Couldn't the reader just look at the Nested Loop's outer side row
> estimate for that?
In my cases, key sources are usually spread across dozens of joins, and
it is visually hard to find out (especially when we have an EXPLAIN
ANALYSE VERBOSE) the JOIN operator to extract the number of calls. The
hit ratio, meanwhile, may be analysed locally in the Memoize node. For
example, 80% (0.8) is evidently a good one, 40% is questionable, and 5%
is too low and we should avoid Memoize here.
May it be beaten by just printing the "calls" number at the Memoize output?
>
>> Frequently, one of the caching keys is from outer table A (10 tuples),
>> and another is from outer table B (100 tuples). Calculating the number
>> of successful cache fetches predicted by the planner may not be evident
>> in the case of a composite cache key.
>>
>> What I may propose here is:
>> 1. Use fraction of calls, for example - 50% duplicated key values.
>> 2. Show the calculated hit and eviction ratio.
>
> I think the primary factors in how useful Memoize is are: 1) How many
> items do we expect to be able to store in the cache concurrently, and;
> 2) How many unique lookups keys do we expect to be looked up, and; 3)
> The total number of expected lookups. #1 is quite difficult to
> figure out (maybe by looking at row width and row estimates) and
> there's just no information about #2. #3 is already shown, in the
> Nested Loop's outer side.
It depends on the task. If you are looking for the answer to how precise
the group's estimation has been (to check statistics), I agree. In cases
I have seen before, the main question is how effective was (or maybe) a
Memoize node == how often the incoming key fits the cache. In that case,
the hit ratio fraction is more understandable for a broad audience.
That's why according to my experience in case of a good cache
reusability factor, users are usually okay with increasing the cache
size to the necessary numbers and avoiding evictions at all costs. So,
the predicted evict_ratio also tells us about incrementing work_mem to
enhance the chances of Memoisation.
Having written the last sentence I came back to the point why work_mem
is so universal and is used at each node as a criteria of memory
allocation size? But it is a different story, I think.

--
regards, Andrei Lepikhov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2025-03-21 09:14:40 Reduce "Var IS [NOT] NULL" quals during constant folding
Previous Message Ni Ku 2025-03-21 08:48:30 Re: Changing shared_buffers without restart