Quick Links

Re: PoC/WIP: Extended statistics on expressions

From:	Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc:	Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC/WIP: Extended statistics on expressions
Date:	2021-03-17 20:48:51
Message-ID:	CAEZATCXz2WO9ZctRL68_HrsPbpre0GRmip83TgznoeLoWm4=Dg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 17 Mar 2021 at 19:07, Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 3/17/21 7:54 PM, Dean Rasheed wrote:
> >
> > it might have been better to estimate the first case as
> >
> > ndistinct((a+b)) * ndistinct(c) * ndistinct(d)
> >
> > and the second case as
> >
> > ndistinct((a+b)) * ndistinct((c+d))
>
> OK. I might be confused, but isn't that what the algorithm currently
> does? Or am I just confused about what the first/second case refers to?
>

No, it currently estimates the first case as ndistinct((a+b),c) *
ndistinct(d). Having said that, maybe that's OK after all. It at least
makes an effort to account for any correlation between (a+b) and
(c+d), using the known correlation between (a+b) and c. For reference,
here is the test case I was using (which isn't really very good for
catching dependence between columns):

DROP TABLE IF EXISTS foo;
CREATE TABLE foo (a int, b int, c int, d int);
INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,100000) x;
SELECT COUNT(DISTINCT a) FROM foo; -- 10
SELECT COUNT(DISTINCT b) FROM foo; -- 11
SELECT COUNT(DISTINCT c) FROM foo; -- 12
SELECT COUNT(DISTINCT d) FROM foo; -- 13
SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20
SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24
SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228
SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478

-- First case: stats on [(a+b),c]
CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
-- Estimate = 2964, Actual = 478
-- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13

-- Second case: stats on (c+d) as well
CREATE STATISTICS s2 ON (c+d) FROM foo;
ANALYSE foo;
EXPLAIN ANALYSE
SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d);
-- Estimate = 480, Actual = 478
-- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24

I think that's probably pretty reasonable behaviour, given incomplete
stats (the estimate with no extended stats is capped at 10000).

Regards,
Dean

In response to

Re: PoC/WIP: Extended statistics on expressions at 2021-03-17 19:07:22 from Tomas Vondra

Responses

Re: PoC/WIP: Extended statistics on expressions at 2021-03-17 20:58:22 from Dean Rasheed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2021-03-17 20:50:27	Re: VACUUM (DISABLE_PAGE_SKIPPING on)
Previous Message	Robert Haas	2021-03-17 20:04:44	Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?