From: | Martijn van Oosterhout <kleptog(at)cupid(dot)suninternet(dot)com> |
---|---|
To: | Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> |
Cc: | Joseph Shraibman <jks(at)selectacast(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: Weird indices |
Date: | 2001-02-21 12:55:48 |
Message-ID: | 20010221235548.A21046@kleptog.nth |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Tue, Feb 20, 2001 at 11:44:15PM -0800, Stephan Szabo wrote:
> On Wed, 21 Feb 2001, Martijn van Oosterhout wrote:
>
> > On Tue, Feb 20, 2001 at 05:02:22PM -0800, Stephan Szabo wrote:
> > >
> > > IIRC, There's something which is effectively :
> > > estimated rows = <most common value's frequency>*<fraction>
> > > I think fraction defaults to (is always?) 1/10 for the standard
> > > index type. That's where the 50 comes from. And the frequency is
> > > probably from the last vacuum analyze.
> >
> > Is there a way to change this fraction?
> >
> > We have a table with over 1 million rows and the statistics Postgres gathers
> > are not particularly useful. There is not one (non-null) value that occurs
> > significantly more often than other values but the distribution looks a lot
> > like a 1/x curve I guess. The most common value occurs 5249 times but the
> > average is only 95, so Postgres chooses seq scan almost always. We actually
> > now set enable_seqscan=off in many areas of our code to speed it up to a
> > useful rate. (This table also happens to have an (accedental) clustering on
> > this column also).
> >
> > What is the reasoning behind estimating like that? Why not just the average
> > or the average + 1 SD?
> >
> > Another idea, is there a use for making a "cohesiveness" index. ie. if
> > you're looking X by looking up the index, on average, how many also matching
> > tuples will be in the next 8k (or whatever size). Since these are likely to
> > be in the cache the cost of retreival would be much lower. This would mean
> > that an index on a clustered column would have a much lower estimated cost
> > than an index on other columns. This would make clustering more useful.
>
> Well, there's been talk about keeping better statistics in the future (see
> hackers archives, I can't remember the thread though, it was a while ago).
> Keeping the most common frequency and some kind of frequency graph or
> standard deviation would probably be useful. As for cohesiveness, that
> gets kind of difficult to keep track of as changes are made but could
> probably be of some use to the estimator.
Well, who said about keeping track of changes? If the table is large chances
are that this value would change very quickly. If the table is small it
doesn't matter. It just seems to me to be the best way make clustering work
better.
> As far as I know the only way to change the fraction is through
> recompiling but Tom would probably know better about that, unfortunately
> that's a really big stick to hit the problem with.
I realize that keeping better statistics is the best solution. However, not
all types data can have a standard deviation since you need some form of
order and that is not obvious in many cases...
Oh well,
--
Martijn van Oosterhout <kleptog(at)cupid(dot)suninternet(dot)com>
http://cupid.suninternet.com/~kleptog/
From | Date | Subject | |
---|---|---|---|
Next Message | Neil Burrows | 2001-02-21 13:01:33 | Inheritance Question |
Previous Message | Evelio Martinez | 2001-02-21 11:49:30 | pg_user, pg_group and arrays |