Re: Yet Another COUNT(*)...WHERE...question

From: Bill Moran <wmoran(at)potentialtech(dot)com>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:57:38
Message-ID: 20070815115738.352012cb.wmoran@potentialtech.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

In response to "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>:

> > Yes, optimization. :) You don't need an exact count to tell someone
> > that there's more data and they can go to it.
>
>
> In general, I agree. But my example of Amazon was only to illustrate
> the point about two queries and why they may be needed. I seem to see
> many more pages than you do, but in any case, Google and Amazon can
> afford to be less precise.
>
> Thanks for the suggestion of using EXPLAIN and parsing an
> approximation, but when you need to show a trader how many trades he
> has made, for instance, then approximation is not a possibility at
> all. Especially not if the numbers sway so wildly --
>
>
> FIRSTDB=# explain select * from trades where t_id = 'kXjha';
> QUERY PLAN
> -----------------------------------------------------------------------------------
> Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224)
> Recheck Cond: ((t_id)::text = 'kXjha'::text)
> -> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0)
> Index Cond: ((t_id)::text = 'kXjha'::text)
> (4 rows)
>
> FIRSTDB=# select count(*) from trades where t_id = 'kXjha';
> count
> -------
> 3891
> (1 row)
>
>
>
> Could I do something so that the EXPLAIN showed up with slightly more
> close-to-accurate stats? The above query is just after a "vacuum
> analyze"!

In the above case, you could probably materialize the data with a trigger
that updates a counter in a separate table every time a new trade is
added. This will give you 100% accurate results with _very_ fast
response time.

Part of the problem is that there's no one answer to your question, there
are multiple approaches to solving it, depending on the details of the
problem and the acceptable time/accuracy of the answers. Some basic
approaches:

1) Materialize the data. MySQL actually does this automatically for you
with MyISAM tables, which is why count(*) is so fast. But if you
absolutely need fast, accurate counts, you can build your own
triggers in PG. This is unlikely to be practical with all queries.
2) Estimate. The accuracy of estimates can vary wildly by query and
how often the database is analyzed, etc. For something like,
"show results 1 - 10 of about 50,000", estimates are great and fast,
but for other cases, not acceptable. The good news is you can get
a fast estimate from any query with no up-front work.
3) Accept that sometimes to get accurate answers it's going to take
time. Around here, we call it the "Orbitz" technique, because when
we discuss it, everyone thinks of the "please wait while I process
your query" page you get from orbitz.com. You'd be surprised how
willing your users are to wait, as long as they know they have to
wait.
4) Throw more hardware at it. If you absolutely _must_have_ super-
accurate results faster, then you may need to buy more RAM, faster
disks and faster CPUs to accomplish it.
5) Come up with something revolutionary that nobody's every thought of
before. Good luck with this one.

Of course, all of these ideas are only practical if you've already
ensured that your system is properly tuned. Crappy values for
shared_buffers and other tuning will lead you to waste time trying
to redesign something that should work just fine, so verify all your
configuration first. You may be able to get more acceptable estimates
by increasing your statistics targets, for example.

--
Bill Moran
http://www.potentialtech.com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Decibel! 2007-08-15 16:00:27 Re: Deadlocks caused by a foreign key constraint
Previous Message Decibel! 2007-08-15 15:56:15 Re: Cluster and MVCC