From: | Alban Hertroys <dalroi(at)solfertje(dot)student(dot)utwente(dot)nl> |
---|---|
To: | Glenn Maynard <glenn(at)zewt(dot)org> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Why count(*) doest use index? |
Date: | 2011-03-08 11:16:00 |
Message-ID: | F63B4236-3859-44D7-AE3D-46679DB62FD4@solfertje.student.utwente.nl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 7 Mar 2011, at 22:16, Glenn Maynard wrote:
> The stats system only helps for the most basic case--counting the number of rows in a table. In my experience that's not very common; most of the time it's counting total results from some more interesting query, eg. for pagination. In my particular case, I'm caching results for SELECT COUNT(*), expr2 FROM table WHERE expr GROUP BY expr2 (for a very limited set of expressions).
It's not uncommon to track your own statistics on your data. Often this doesn't go beyond tracking COUNT(*) on various combinations of conditions.
If your data approaches a normal distribution though, I think you can go a step further by tracking the distribution of values in one column for a given value in another.
I'm not a mathematician, but I'm pretty sure you could do something like this (with the example given down-thread) to describe the distributions of values in your main table:
CREATE TABLE user_event_time (
user integer UNIQUE REFERENCES events (user),
count integer,
min date,
max date,
avg date,
stddev date
);
CREATE TABLE event_time_user (
event_time date UNIQUE REFERENCES events (event_time),
count integer,
min integer,
max integer,
avg integer,
stddev integer
);
Now, given a user ID, the first table gives you the chance of a specific event_time occurring - which with a normal distribution should be very close to the percentage of the total number of rows that match the set. Say you have 1000 rows and there's 23% chance that there's an event involving user 50 at '2011-01-01', then that means 230 rows match those conditions.
You can do the same query the other way around base on the event time and the distribution of users at that date.
Combining both will give you better accuracy.
Whether this is practical to do is another question entirely, I just thought of this while reading this thread ;)
Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.
!DSPAM:737,4d760ff7235881825915661!
From | Date | Subject | |
---|---|---|---|
Next Message | Francisco Figueiredo Jr. | 2011-03-08 14:53:07 | Re: unexpected EOF on client connection vs 9.0.3 |
Previous Message | rsmogura | 2011-03-08 08:50:36 | Re: unexpected EOF on client connection vs 9.0.3 |