Quick Links

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Rod Taylor <rbt(at)sitesell(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Gurmeet Manku <manku(at)cs(dot)stanford(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Date:	2005-04-26 23:28:21
Message-ID:	87oec15d9m.fsf@stark.xeocode.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers pgsql-performance

Rod Taylor <rbt(at)sitesell(dot)com> writes:

> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good.
> >
> > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> >
> > It does require a single full table scan
>
> Ack.. Not by default please.
>
> I have a few large append-only tables (vacuum isn't necessary) which do
> need stats rebuilt periodically.

The algorithm can also naturally be implemented incrementally. Which would be
nice for your append-only tables. But that's not Postgres's current philosophy
with statistics. Perhaps some trigger function that you could install yourself
to update statistics for a newly inserted record would be useful.

The paper is pretty straightforward and easy to read, but here's an executive
summary:

The goal is to gather a uniform sample of *distinct values* in the table as
opposed to a sample of records.

Instead of using a fixed percentage sampling rate for each record, use a hash
of the value to determine whether to include it. At first include everything,
but if the sample space overflows throw out half the values based on their
hash value. Repeat until finished.

In the end you'll have a sample of 1/2^n of your distinct values from your
entire data set where n is large enough for you sample to fit in your
predetermined constant sample space.

--
greg

In response to

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? at 2005-04-26 23:10:11 from Rod Taylor

Responses

Re: [PERFORM] Bad n_distinct estimation; hacks suggested? at 2005-04-27 00:16:32 from Rod Taylor

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	John DeSoi	2005-04-27 00:02:30	Re: [HACKERS] Continue transactions after errors in psql
Previous Message	Rod Taylor	2005-04-26 23:10:11	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Josh Berkus	2005-04-27 00:00:13	Re: Table Partitioning: Will it be supported in Future?
Previous Message	Rod Taylor	2005-04-26 23:10:11	Re: [PERFORM] Bad n_distinct estimation; hacks suggested?