From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: WIP: multivariate statistics / proof of concept |
Date: | 2015-01-24 20:21:39 |
Message-ID: | 54C3FED3.1060600@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
attached is an updated version of the multivariate stats patch. This is
going to be a bit longer mail, so I'll put here a small ToC ;-)
1) patch split into 4 parts
2) where to start / documentation
3) state of the code
4) main changes/improvements
5) remaining limitations
The motivation and design ideas, explained in the first message of this
thread are still valid. It might be a good idea to read it first:
http://www.postgresql.org/message-id/flat/543AFA15(dot)4080608(at)fuzzy(dot)cz
BTW if you happen to go to FOSDEM [PGDay], I'll gladly give you an intro
into the patch in person, or discuss the patch in general.
1) Patch split into 4 parts
---------------------------
Firstly, the patch got broken into the following four pieces, to make
the reviews somewhat easier:
1) 0001-shared-infrastructure-and-functional-dependencies.patch
- infrastructure, shared by all the kinds of stats added
in the following patches (catalog, ALTER TABLE, ANALYZE ...)
- implementation of a simple statistics, tracking functional
dependencies between columns (previously called "associative
rules", but that's incorrect for several reasons)
- this does not modify the optimizer in any way
2) 0002-clause-reduction-using-functional-dependencies.patch
- applies the functional dependencies to optimizer (i.e. considers
the rules in clauselist_selectivity())
3) 0003-multivariate-MCV-lists.patch
- multivariate MCV lists (both ANALYZE and optimizer parts)
4) 0004-multivariate-histograms.patch
- multivariate histograms (both ANALYZE and optimizer parts)
You may look at the patches at github here:
https://github.com/tvondra/postgres/tree/multivariate-stats-squashed
The branch is not stable, i.e. I'll rebase / squash / force-push changes
in the future. (There's also multivariate-stats development branch with
unsquashed changes, but you don't want to look at that, trust me.)
The patches are not exactly small (being in the 50-100 kB range), but
that's mostly because of the amount of comments explaining the goals and
implementation details.
2) Where to start / documentation
---------------------------------
I strived to document all the pieces properly, mostly in the form of
comments. There's no sgml documentation at this point, which should
obviously change in the future.
Anyway, I'd suggest reading the first e-mail in this thread, explaining
the ideas, and then these comments:
1) functional dependencies (patch 0001)
- src/backend/utils/mvstats/dependencies.c
2) MCV lists (patch 0003)
- src/backend/utils/mvstats/mcv.c
3) histograms (patch 0004)
- src/backend/utils/mvstats/mcv.c
- also see clauselist_mv_selectivity_mcvlist() in clausesel.c
- also see clauselist_mv_selectivity_histogram() in clausesel.c
4) selectivity estimation (patches 0002-0004)
- all in src/backend/optimizer/path/clausesel.c
- clauselist_selectivity() - overview of how the stats are applied
- clauselist_apply_dependencies() - functional dependencies reduction
- clauselist_mv_selectivity_mcvlist() - MCV list estimation
- clauselist_mv_selectivity_histogram() - histogram estimation
3) State of the code
--------------------
I've spent a fair amount of time testing the patches, and while I
believe there are no segfaults or so, I know parts of the code need a
bit more love.
The part most in need of improvements / comments is probably the code in
clausesel.c - that seems a bit quirky. Reviews / comments regarding this
part of the code are very welcome - I'm sure there are many ways to
improve this part.
There are a few FIXMEs elsewhere (e.g. about memory allocation in the
(de)serialization code), but those are mostly well-defined issues that I
know how to address (at least I believe so).
4) Main changes/improvements
----------------------------
There are many significant improvements. The previous patch version was
in the 'proof of concept' category (missing pieces, knowingly broken in
some areas), the current patch should 'mostly work'.
The patch fixes two most annoying limitations of the first version:
(a) support for all data types (not just those passed by value)
(b) handles NULL values properly
(c) adds support for IS [NOT] NULL clauses
Aside from that the code was significantly improved, there are proper
regression tests and plenty of comments explaining the details.
5) Remaining limitations
------------------------
(a) limited to stats on 8 columns
This is mostly just a 'safeguard' restriction.
(b) only data types with '<' operator
I don't think this will change anytime soon, because all the
algorithms for building the stats rely on this. I don't see
this as a serious limitation though.
(c) not handling DROP COLUMN or DROP TABLE and so on
Currently this is not handled at all (so the regression tests
do an explicit DELETE from the pg_mv_statistic catalog).
Handling the DROP TABLE won't be difficult, it's similar to the
current stats. Handling ALTER TABLE ... DROP COLUMN will be much
more tricky I guess - should we drop all the stats referencing
that column, or should we just remove it from the stats? Or
should we keep it and treat it as NULL? Not sure what's the best
solution.
(d) limited list of compatible WHERE clauses
The initial patch handled only simple operator clauses
(Var op Constant)
where operator is one of ('<', '<=', '=', '>=', '>'). Now it also
handles IS [NOT] NULL clauses. Adding more clause types should
not be overly difficult - starting with more traditional
'BooleanTest' conditions, or even multi-column conditions
(Var op Var)
which are difficult to estimate using simple-column stats.
(e) optimizer uses single stats per table
This is still true and I don't think this will change soon. i do
have some ideas on how to merge multiple stats etc. but it's
certainly complex stuff, unlikely to happen within this CF. The
patch makes a lot of sense even without this particular feature,
because you can create multiple stats, each suitable for different
queries.
(f) no JOIN conditions
Similarly to the previous point, it's on the TODO but it's not
going to happen in this CF.
kind regards
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment | Content-Type | Size |
---|---|---|
0001-shared-infrastructure-and-functional-dependencies.patch | text/x-diff | 59.8 KB |
0002-clause-reduction-using-functional-dependencies.patch | text/x-diff | 53.4 KB |
0003-multivariate-MCV-lists.patch | text/x-diff | 100.6 KB |
0004-multivariate-histograms.patch | text/x-diff | 114.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2015-01-24 21:31:14 | Shortcoming in CLOBBER_FREED_MEMORY coverage: disk buffer pointers |
Previous Message | Tom Lane | 2015-01-24 16:59:37 | Re: numeric access out of bounds |