From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Noah Misch <noah(at)leadboat(dot)com>, Nathan Boley <npboley(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Collect frequency statistics for arrays |
Date: | 2012-03-12 08:06:44 |
Message-ID: | CAPpHfdsoBcKokndg76RTKwo-Az8eoxdzVyLSKK5oGdhuZWopkQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > True. If (max count - min count + 1) is small, enumerating of frequencies
> > is both more compact and more precise representation. Simultaneously,
> > if (max count - min count + 1) is large, we can run out of
> > statistics_target with such representation. We can use same
> representation
> > of count distribution as for scalar column value: MCV and HISTOGRAM, but
> it
> > would require additional statkind and statistics slot. Probably, you've
> > better ideas?
>
> I wasn't thinking of introducing two different representations,
> but just trimming the histogram length when it's larger than necessary.
>
> On reflection my idea above is wrong; for example assume that we have a
> column with 900 arrays of length 1 and 100 arrays of length 2. Going by
> what I said, we'd reduce the histogram to {1,2}, which might accurately
> capture the set of lengths present but fails to show that 1 is much more
> common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
> would capture the situation perfectly in one-tenth the space that the
> current logic does.
>
> More generally, by limiting the histogram to statistics_target entries,
> we are already accepting errors of up to 1/(2*statistics_target) in the
> accuracy of the bin-boundary representation. What the above example
> shows is that sometimes we could meet the same accuracy requirement with
> fewer entries. I'm not sure how this could be mechanized but it seems
> worth thinking about.
>
I can propose following representation of histogram.
If (max_count - min_count + 1) <= statistics_target then
1) store max_count and min_count in stavalues
2) store frequencies from min_count ot max_count in numvalues
If (max_count - min_count + 1) > statistics_target then
store histogram in current manner. I think in this case it's unlikely to be
many repeating values.
I can propose patch which change histogram representation to this.
Comments?
------
With best regards,
Alexander Korotkov.
From | Date | Subject | |
---|---|---|---|
Next Message | Etsuro Fujita | 2012-03-12 08:07:30 | Corrected: Re: pgsql_fdw, FDW for PostgreSQL server |
Previous Message | Pavel Stehule | 2012-03-12 04:43:01 | Re: poll: CHECK TRIGGER? |