Re: Collect frequency statistics for arrays

From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Collect frequency statistics for arrays
Date: 2012-03-08 19:22:46
Message-ID: 20120308192246.GB20680@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 08, 2012 at 11:30:52AM -0500, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
> > On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote:
> >> On reflection my idea above is wrong; for example assume that we have a
> >> column with 900 arrays of length 1 and 100 arrays of length 2. Going by
> >> what I said, we'd reduce the histogram to {1,2}, which might accurately
> >> capture the set of lengths present but fails to show that 1 is much more
> >> common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> >> would capture the situation perfectly in one-tenth the space that the
> >> current logic does.
>
> > Granted. When the next sample finds 899/101 instead, though, the optimization
> > vanishes.
>
> No, you missed my next point. That example shows that sometimes a
> smaller histogram can represent the situation with zero error, but in
> all cases a smaller histogram can represent the situation with perhaps
> more error than a larger one. Since we already have a defined error
> tolerance, we should try to generate a histogram that is as small as
> possible while still not exceeding the error tolerance.
>
> Now, it might be that doing that is computationally impractical, or
> too complicated to be reasonable. But it seems to me to be worth
> looking into.

Yes, I did miss your point.

One characteristic favoring this approach is its equal applicability to both
STATISTIC_KIND_HISTOGRAM and STATISTIC_KIND_DECHIST.

> > If you want to materially narrow typical statistics, Alexander's
> > proposal looks like the way to go. I'd guess array columns always
> > having DEC <= default_statistics_target are common enough to make that
> > representation the dominant representation, if not the only necessary
> > representation.
>
> Well, I don't want to have two representations; I don't think it's worth
> the complexity. But certainly we could consider switching to a
> different representation if it seems likely to usually be smaller.

Perhaps some heavy array users could provide input: what are some typical
length ranges among arrays in your applications on which you use "arr &&
const", "arr @> const" or "arr <@ const" searches?

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2012-03-08 19:27:08 Re: Custom Operators Cannot be Found for Composite Type Values
Previous Message Tom Lane 2012-03-08 19:16:33 Re: Custom Operators Cannot be Found for Composite Type Values