From: | Henk Ernst Blok <h(dot)e(dot)blok(at)utwente(dot)nl> |
---|---|
To: | Tino Wildenhain <tino(at)wildenhain(dot)de> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Strange count(*) implementation? |
Date: | 2004-10-26 13:25:05 |
Message-ID: | 417E5031.8040409@utwente.nl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi Tino,
Tino Wildenhain wrote:
>>I assume(d) the more expensive statistics (e.g., value distribution
>>info) are updated only when outdated too much or on request (manual
>>vacuum). Usually, other/cheap statistics can easily be maintained
>>incrementally and thus reflect actual table state after each update.
>>
>>
>I remember some discussion about this. But also here the MVCC and
>the general call for performance leads to the current solution
>where statistics are only updated on vacuum. With PG8.0 you have
>vacuum strategies with better performance which can run more often
>as I understand - so while not giving you exact figures your
>count() could be estimated at least.
>
>
I need exact figures at the moment due to the type of some scientific
experiments I'm running. Approximations are in the pipeline but I need a
base-line run without any tricks that affect the accuracy of the numbers
involved.
>> Of course, the MVCC principle seems to make things a bit more
>>complicated I understand now. But tracking whether statistics are
>>dirty has to be in the system anyway. How does it otherwise decide
>>when to do its own statistics updates? So if explain can get the most
>>recent count, why not use it in the count as well if you know the
>>statistics are still acurate?
>>
>>
>The point is: you dont know it. There is curently no: mark statistics
>dirty if table has new tuples or tuples removed.
>
>
OK. Didn't know that, but got the impression by now it worked that way
in Postgres.
>>By the way, a count(*) without any where does occur very frequently if
>>you are dealing with an OLAP load, which is the case in my setting.
>>So, I indeed already 'fixed' the performance problem by precomputing
>>all counts I can predict to be of any use.
>>
>>
>I'm not familar with OLAP specifics, so what is the meaning of the
>count() here? What is done with this information?
>
>
OLAP stands for OnLine Analystical Processing, typically meaning heavy
queries with a lot of data (for typical examples, see: http://www.tpc.org/
the TPC-H query set in particular). So decision support and datamining
are in that area for instance. My topic of interest is IR (information
retrieval) in a database context. My experiments behave like an OLAP
load at the moment. My current experiments involve a lot of counting and
expensive joins as I have to compute certain estimators in a
mathematical model I'm working on, hence the importance of the count... ;)
On MySQL each of the 30 queries I have to run took on average about 24
h. As my queries are getting even complexer I'm now trying to find out
whether Postgres can do a better job.
Regards,
Henk Ernst
--
address: DB group, Computer Science, EEMCS Dept., University of Twente,
PO Box 217, 7500 AE, ENSCHEDE, THE NETHERLANDS
phone: ++31 (0)53 489 3754 (if no response: 3690)
email: h(dot)e(dot)blok(at)utwente(dot)nl
WWW: http://www.cs.utwente.nl/~blokh
From | Date | Subject | |
---|---|---|---|
Next Message | Valentin Militaru | 2004-10-26 13:27:49 | Re: Any plans on allowing user-defined triggers to |
Previous Message | Mike Mascari | 2004-10-26 13:18:34 | Re: Any plans on allowing user-defined triggers to be deferrable? |