From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, John Naylor <jcnaylor(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: MCV lists for highly skewed distributions |
Date: | 2018-03-16 15:26:51 |
Message-ID: | 8a524773-3299-fe7b-de8b-bc977de97bec@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 03/11/2018 10:23 AM, Dean Rasheed wrote:
> ...
>
> I'm moving this back to a status of "Needs review" partly because the
> code has changed significantly, but also because I want to do more
> testing, particularly with larger datasets.
>
Thanks for working on this. The code seems fine to me, although it might
be a good idea to add comments explaining why analyze_mcv_list() starts
with full MCV lists and then removes items (essentially the explanation
why Jeff's original idea would not work so well).
Actually, one question - when deciding whether to keep the item in the
MCV list, analyze_mcv_list only compares it's frequency with an average
of the rest. But as we're removing items from the MCV list, the average
frequency of the non-MCV items is growing (we're removing items with
higher and higher frequencies). That means the estimates for the least
common items will get higher and higher - essentially overestimates. So,
couldn't/shouldn't analyze_mcv_list consider this too?
I've also done a bunch of testing regarding join cardinality estimates,
because eqjoinsel_inner() is one of the places using MCV lists too. So
I've generated tables with different sizes and data distributions, and
observed how the patch affects the join estimates.
The scripts and results are available here:
https://github.com/tvondra/join-estimates-tests
The tables were not particularly huge at this point (1000 to 100k rows),
I've also varied number of distinct values (100 - 10k), statistics
target (10 - 10k) and distribution (for each table independently):
1) uniform
insert into t1 select random() * $ndistinct1, i
from generate_series(1,$nrows1) s(i)
2) skewed
insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)
3) skewed-inverse
insert into t2 select (1 - pow(random(),2)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)
4) skewed
insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)
5) skewed-strong-inverse
insert into t2 select (1 - pow(random(),4)) * $ndistinct2, i
from generate_series(1,$nrows2) s(i)
The results are a bit too large to attach them here - see for example
https://github.com/tvondra/join-estimates-tests/blob/master/bench/summary.ods.
Looking at Mean Relative Error, i.e. essentially
MRE = AVERAGE(MAX(estimate/actual,actual/estimate))
over the 20 runs for each combination of parameters, The "average" sheet
shows
(MRE for patched) / (MRE for master)
and the patch seems to clearly improve accuracy in this case. There are
a few cases where the estimate gets slightly worse (say, by 10%)
compared to current master. So I think that's nice.
I'm open to running additional tests with other distributions, table
sizes etc. if needed.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-03-16 15:40:21 | Re: segmentation fault in pg head with SQL function. |
Previous Message | Alvaro Herrera | 2018-03-16 15:13:03 | Re: ON CONFLICT DO UPDATE for partitioned tables |