Re: Multidimensional Histograms

From: Alexander Cheshev <alex(dot)cheshev(at)gmail(dot)com>
To: Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Teodor Sigaev <teodor(at)postgrespro(dot)ru>
Subject: Re: Multidimensional Histograms
Date: 2024-01-08 09:21:34
Message-ID: CAN_hQms_KKKGqyhUopk6t=cDgti5p3UWx9ruKXHsTERoXRs2-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Andrei,

> Maybe my wording needed to be more precise. I didn't implement
> multidimensional histograms before, so I don't know how expensive they
> are. I meant that for dependency statistics over about six columns, we
> have a lot of combinations to compute.

Equi-Depth Histogram in a 6 dimensional case requires 6 times more
iterations. Postgres already uses Equi-Depth Histogram. Even if you
increase the number of buckets from 100 to 1000 then there will be no
overhead as the time complexity of Equi-Depth Histogram has no
dependence on the number of buckets. So, no overhead at all!

Regards,
Alexander Cheshev

On Mon, 8 Jan 2024 at 04:12, Andrei Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
>
> On 8/1/2024 01:36, Tomas Vondra wrote:
> > On 1/7/24 18:26, Andrei Lepikhov wrote:
> >> On 7/1/2024 17:51, Tomas Vondra wrote:
> >>> On 1/7/24 11:22, Andrei Lepikhov wrote:
> >>>> On 7/1/2024 06:54, Tomas Vondra wrote:
> >>>>> It's an interesting are for experiments, no doubt about it. And if you
> >>>>> choose to explore it, that's fine. But it's better to be aware it may
> >>>>> not end with a commit.
> >>>>> For the multi-dimensional case, I propose we first try to experiment
> >>>>> with the various algorithms, and figure out what works etc. Maybe
> >>>>> implementing them in python or something would be easier than C.
> >>>>
> >>>> Curiously, trying to utilize extended statistics for some problematic
> >>>> cases, I am experimenting with auto-generating such statistics by
> >>>> definition of indexes [1]. Doing that, I wanted to add some hand-made
> >>>> statistics like a multidimensional histogram or just a histogram which
> >>>> could help to perform estimation over a set of columns/expressions.
> >>>> I realized that current hooks get_relation_stats_hook and
> >>>> get_index_stats_hook are insufficient if I want to perform an estimation
> >>>> over a set of ANDed quals on different columns.
> >>>> In your opinion, is it possible to add a hook into the extended
> >>>> statistics to allow for an extension to propose alternative estimation?
> >>>>
> >>>> [1] https://github.com/danolivo/pg_index_stats
> >>>>
> >>>
> >>> No idea, I haven't thought about that very much. Presumably the existing
> >>> hooks are insufficient because they're per-attnum? I guess it would make
> >>> sense to have a hook for all the attnums of the relation, but I'm not
> >>> sure it'd be enough to introduce a new extended statistics kind ...
> >>
> >> I got stuck on the same problem Alexander mentioned: we usually have
> >> large tables with many uniformly distributed values. In this case, MCV
> >> doesn't help a lot.
> >>
> >> Usually, I face problems scanning a table with a filter containing 3-6
> >> ANDed quals. Here, Postgres multiplies selectivities and ends up with a
> >> less than 1 tuple selectivity. But such scans, in reality, mostly have
> >> some physical sense and return a bunch of tuples. It looks like the set
> >> of columns representing some value of composite type.
> >
> > Understood. That's a fairly common scenario, and it can easily end up
> > with rather terrible plan due to the underestimate.
> >
> >> Sometimes extended statistics on dependency helps well, but it expensive
> >> for multiple columns.
> >
> > Expensive in what sense? Also, how would a multidimensional histogram be
> > any cheaper?
> Maybe my wording needed to be more precise. I didn't implement
> multidimensional histograms before, so I don't know how expensive they
> are. I meant that for dependency statistics over about six columns, we
> have a lot of combinations to compute.
> >
> >> And sometimes I see that even a trivial histogram
> >> on a ROW(x1,x2,...) could predict a much more adequate value (kind of
> >> conservative upper estimation) for a clause like "x1=N1 AND x2=N2 AND
> >> ..." if somewhere in extension we'd transform it to ROW(x1,x2,...) =
> >> ROW(N1, N2,...).
> >
> > Are you guessing the histogram would help, or do you know it'd help? I
> > have no problem believing that for range queries, but I think it's far
> > less clear for simple equalities. I'm not sure a histograms would work
> > for that ...
>
> I added Teodor Sigaev to the CC of this email - He has much more user
> feedback on this technique. As I see, having a histogram over a set of
> columns, we have top selectivity estimation for the filter. I'm not sure
> how good a simple histogram is in that case, too, but according to my
> practice, it works, disallowing usage of too-optimistic query plans.
>
> > Maybe it'd be possible to track more stuff for each bucket, not just the
> > frequency, but also ndistinct for combinations of columns, and then use
> > that to do better equality estimates. Or maybe we could see how the
> > "expected" and "actual" bucket frequency compare, and use that to
> > correct the equality estimate? Not sure.
> Yes, it is in my mind, too. Having such experimental stuff as an
> extension(s) in GitHub, we could get some community feedback.
> >
> > But perhaps you have some data to demonstrate it'd actually help?
> It should be redirected to Teodor, but I will consider translating messy
> real-life reports into a clear example.
> >
> >> For such cases we don't have an in-core solution, and introducing a hook
> >> on clause list estimation (paired with maybe a hook on statistics
> >> generation) could help invent an extension that would deal with that
> >> problem. Also, it would open a way for experiments with different types
> >> of extended statistics ...
> > I really don't know how to do that, or what would it take. AFAICS the
> > statistics system is not very extensible from external code. Even if we
> > could add new types of attribute/extended stats, I'm not sure the code
> > calculating the estimates could use that.
> I imagine we can add an analysis routine and directly store statistics
> in an extension for demonstration and experimental purposes, no problem.
>
> --
> regards,
> Andrei Lepikhov
> Postgres Professional
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2024-01-08 09:43:05 Re: Change GUC hashtable to use simplehash?
Previous Message ywgrit 2024-01-08 09:11:44 Change comments of removing useless joins.