From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Merging statistics from children instead of re-sampling everything |
Date: | 2021-03-29 15:54:25 |
Message-ID: | c3162511-0e80-039e-765c-1d02d0fd0bab@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
While reviewing the thread about issues with auto-analyze on partitioned
tables [1] I remembered that the way we build statistics on the parent
is somewhat expensive, because it samples rows from the children again.
It's true we sample much smaller amounts of rows from each partition
(proportional to it's size), so it's not as expensive as simply running
ANALYZE on all partitions individually. But it's not free either, and in
some cases (e.g. with partitioning by time) it may touch old data that
is not touched by regular queries, so likely not cached etc. That's a
bit against the idea to use partitioning to "segregate" old data.
One reason why the ANALYZE costs are not a huge issue in practice is
that we're not actually triggering that - changes_since_analyze is not
updated for the parent, so autoanalyze does not realize it needs to do
anything, and we never rebuild the statistics :-( But as shown in [2],
that may lead to poor estimates (and bad plans) in cases when we rely on
the parent's stats (probably due to not expanding the list of children
early enough).
(Note: I wonder if we might simply not rely on the parent stats at all,
and always consult directly the children stats - but with many children
that's likely problematic/expensive, and it does have the same issues as
the merging.)
The other thread [1] attempts to fix that by incrementing the counters
for the parent too, but that'll make this much worse. firstly, with
multi-level partitioning we'll have to sample the children repeatedly,
essentially once for each level. Secondly, it's tricky to control when
exactly are the counters propagated to the parent, making the parent
analyzes more frequent. Not great.
Even if we do a great job in [1] and come up with smart heuristics,
there will always be cases where it does not work too well and we either
analyze too late or too often.
Note: Propagating changes_since_analyze is only part of the story, as it
does not do anything about stats after attach/detach of a partition.
This attempts to approach the problem from the other end - instead of
tightly controlling when to analyze the parent, it makes the analyze
much cheaper. That means we don't need to worry too much about doing the
analyze too often, and we can update the stats more aggressively.
So, how does it work? Well, it simply fetches the statistics for all
children, and merges them together. For most statistics that's fairly
simple for most statistics types.
1) stawidth, stanullfrac - Those are trivial to merge.
2) stacorrelation - Not sure, I've used weighted average for now.
Perhaps we could store the "internal" counters (sumX, sumX2) which would
allow us to calculate regular estimate for the parent.
3) stadistinct - This is quite problematic. We only have the per-child
estimates, and it's not clear if there's any overlap. For now I've just
summed it up, because that's safer / similar to what we do for gather
merge paths etc. Maybe we could improve this by estimating the overlap
somehow (e.g. from MCV lists / histograms). But honestly, I doubt the
estimates based on tiny sample of each child are any better. I suppose
we could introduce a column option, determining how to combine ndistinct
(similar to how we can override n_distinct itself).
4) MCV - It's trivial to build a new "parent" MCV list, although it may
be too large (in which case we cut it at statistics target, and copy the
remaining bits to the histogram)
5) histograms - A bit more complicated, because it requires dealing with
overlapping bins, so we may have to "cut" and combine them in some way.
If we assume that cutting a bin in K parts means each part has 1/K
tuples (no matter where exactly we cut it), then it's trivial and it
works just fine in practice. That's because with N children, each bin
actually represents 1.0/(target*N) of the tuples, so the errors are
quite limited.
The attached patch is a PoC - it should work, but there's plenty of room
for improvement. It only deals with "regular" per-column statistics, not
with extended stats (but I don't see why it wouldn't work for them too).
It adds a new analyze option "MERGE" which does not sample the children
but instead just combines the statistics. So the example from [2] looks
like this:
======================================================================
create table p (i integer, j integer) partition by list (i);
create table p0 partition of p for values in (0);
create table p1 partition of p for values in (1);
insert into p select 0,generate_series(1,1000);
insert into p select 1,generate_series(1,1000);
analyze p0;
analyze p1;
create table q (i integer);
insert into q values (0);
analyze q;
test=# explain select * from q join p using (i) where j
between 1 and 500;
QUERY PLAN
---------------------------------------------------------------------
Hash Join (cost=1.02..49.82 rows=5 width=8)
Hash Cond: (p.i = q.i)
-> Append (cost=0.00..45.00 rows=1000 width=8)
-> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Hash (cost=1.01..1.01 rows=1 width=4)
-> Seq Scan on q (cost=0.00..1.01 rows=1 width=4)
(9 rows)
test=# analyze (merge) p;
ANALYZE
test=# explain select * from q join p using (i) where j
between 1 and 500;
QUERY PLAN
---------------------------------------------------------------------
Hash Join (cost=1.02..54.77 rows=500 width=8)
Hash Cond: (p.i = q.i)
-> Append (cost=0.00..45.00 rows=1000 width=8)
-> Seq Scan on p0 p_1 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Seq Scan on p1 p_2 (cost=0.00..20.00 rows=500 width=8)
Filter: ((j >= 1) AND (j <= 500))
-> Hash (cost=1.01..1.01 rows=1 width=4)
-> Seq Scan on q (cost=0.00..1.01 rows=1 width=4)
(9 rows)
======================================================================
FWIW I'm not sure we need the separate MERGE mode, but it's an easy way
to allow both the regular and "merge" approach, so that it's possible to
experiment and compare the behavior.
One issue is that this would require some coordination between the
parent/child analyzes. Essentially, any time a child is analyzed, the
parent should rebuild the stats (to merge from the new child stats).
This is similar to the issue of analyzing the parent too often because
we don't know when exactly the counters get updated, but it's much
cheaper to merge the stats, so it's much less problematic.
regards
[1] https://commitfest.postgresql.org/32/2492/
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment | Content-Type | Size |
---|---|---|
0001-Combine-statistics-from-child-relations.patch | text/x-patch | 26.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2021-03-29 16:30:34 | Re: Pgsql Google Summer of Code |
Previous Message | Joel Jacobson | 2021-03-29 15:50:20 | Re: Idea: Avoid JOINs by using path expressions to follow FKs |