Quick Links

Re: PoC: using sampling to estimate joins / complex conditions

From:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: PoC: using sampling to estimate joins / complex conditions
Date:	2022-03-22 00:17:26
Message-ID:	025bbe78-0492-1a5a-76d8-e5d06581ac16@enterprisedb.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/22/22 00:35, Andres Freund wrote:
> Hi,
>
> On 2022-01-21 01:06:37 +0100, Tomas Vondra wrote:
>> Yeah, I haven't updated some of the test output because some of those
>> changes are a bit wrong (and I think that's fine for a PoC patch). I
>> should have mentioned that in the message, though. Sorry about that.
>
> Given that the patch hasn't been updated since January and that it's a PoC in
> the final CF, it seems like it should at least be moved to the next CF? Or
> perhaps returned?
>
> I've just marked it as waiting-on-author for now - iirc that leads to fewer
> reruns by cfbot once it's failing...
>

Either option works for me.

>
>> 2) The correlated samples are currently built using a query, executed
>> through SPI in a loop. So given a "driving" sample of 30k rows, we do
>> 30k lookups - that'll take time, even if we do that just once and cache
>> the results.
>
> Ugh, yea, that's going to increase overhead by at least a few factors.
>
>
>> I'm sure there there's room for some improvement, though - for example
>> we don't need to fetch all columns included in the statistics object,
>> but just stuff referenced by the clauses we're estimating. That could
>> improve chance of using IOS etc.
>
> Yea. Even just avoid avoiding SPI / planner + executor seems likely to be a
> big win.
>
>
> It seems one more of the cases where we really need logic to recognize "cheap"
> vs "expensive" plans, so that we only do sampling when useful. I don't think
> that's solved just by having a declarative syntax.
>

Right.

I was thinking about walking the first table, collecting all the values,
and then doing a single IN () query for the second table - a bit like a
custom join (which seems a bit terrifying, TBH).

But even if we manage to make this much cheaper, there will still be
simple queries where it's going to be prohibitively expensive.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: PoC: using sampling to estimate joins / complex conditions at 2022-03-21 23:35:41 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2022-03-22 00:18:29	Re: PATCH: generate fractional cheapest paths in generate_orderedappend_path
Previous Message	Andres Freund	2022-03-22 00:15:59	Re: Mingw task for Cirrus CI