Re: Strange count(*) implementation?

From: Henk Ernst Blok <h(dot)e(dot)blok(at)utwente(dot)nl>
To: Tino Wildenhain <tino(at)wildenhain(dot)de>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Strange count(*) implementation?
Date: 2004-10-26 13:25:05
Message-ID: 417E5031.8040409@utwente.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Tino,

Tino Wildenhain wrote:

>>I assume(d) the more expensive statistics (e.g., value distribution
>>info) are updated only when outdated too much or on request (manual
>>vacuum). Usually, other/cheap statistics can easily be maintained
>>incrementally and thus reflect actual table state after each update.
>>
>>
>I remember some discussion about this. But also here the MVCC and
>the general call for performance leads to the current solution
>where statistics are only updated on vacuum. With PG8.0 you have
>vacuum strategies with better performance which can run more often
>as I understand - so while not giving you exact figures your
>count() could be estimated at least.
>
>
I need exact figures at the moment due to the type of some scientific
experiments I'm running. Approximations are in the pipeline but I need a
base-line run without any tricks that affect the accuracy of the numbers
involved.

>> Of course, the MVCC principle seems to make things a bit more
>>complicated I understand now. But tracking whether statistics are
>>dirty has to be in the system anyway. How does it otherwise decide
>>when to do its own statistics updates? So if explain can get the most
>>recent count, why not use it in the count as well if you know the
>>statistics are still acurate?
>>
>>
>The point is: you dont know it. There is curently no: mark statistics
>dirty if table has new tuples or tuples removed.
>
>
OK. Didn't know that, but got the impression by now it worked that way
in Postgres.

>>By the way, a count(*) without any where does occur very frequently if
>>you are dealing with an OLAP load, which is the case in my setting.
>>So, I indeed already 'fixed' the performance problem by precomputing
>>all counts I can predict to be of any use.
>>
>>
>I'm not familar with OLAP specifics, so what is the meaning of the
>count() here? What is done with this information?
>
>
OLAP stands for OnLine Analystical Processing, typically meaning heavy
queries with a lot of data (for typical examples, see: http://www.tpc.org/
the TPC-H query set in particular). So decision support and datamining
are in that area for instance. My topic of interest is IR (information
retrieval) in a database context. My experiments behave like an OLAP
load at the moment. My current experiments involve a lot of counting and
expensive joins as I have to compute certain estimators in a
mathematical model I'm working on, hence the importance of the count... ;)
On MySQL each of the 30 queries I have to run took on average about 24
h. As my queries are getting even complexer I'm now trying to find out
whether Postgres can do a better job.

Regards,

Henk Ernst

--
address: DB group, Computer Science, EEMCS Dept., University of Twente,
PO Box 217, 7500 AE, ENSCHEDE, THE NETHERLANDS
phone: ++31 (0)53 489 3754 (if no response: 3690)
email: h(dot)e(dot)blok(at)utwente(dot)nl
WWW: http://www.cs.utwente.nl/~blokh

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Valentin Militaru 2004-10-26 13:27:49 Re: Any plans on allowing user-defined triggers to
Previous Message Mike Mascari 2004-10-26 13:18:34 Re: Any plans on allowing user-defined triggers to be deferrable?