Re: Caching of Queries

From: Pierre-Frédéric Caillaud <lists(at)boutiquenumerique(dot)com>
To: "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Caching of Queries
Date: 2004-10-03 13:20:49
Message-ID: opsfap0zbrcq72hf@musicbox
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


> pgpool (which makes some rather questionable claims IMO); any decent web
> application language/environment will support connection pooling.

That's why it should not be tied to something specific as pgpool.

If you want performance, which is the case here, usually you have a
webserver serving static files, and an application server serving dynamic
pages.
This is not necessarily a huge application server, it can be as simple as
an Apache instance serving static files, with a special path mod_proxy'ed
to another instance of apache acting as an application server.
IMHO this is a nice way to do it, because you have a light weight static
files server which can spawn many processes without using precious
resources like memory and postgres connections, and a specialized server
which has a lot less processes, each one having more size, a db
connection, etc. The connexions are permanent, of course, so there is no
connection overhead. The proxy has an extra advantage buffering the data
from the "app server" and sending it back slowly to the client, so the app
server can then very quickly process the next request instead of hogging a
db connection while the html is slowly trickled back to the client.
IMHO the standard PHP way of doing things (just one server) is wrong
because every server process, even if it's serving static files, hogs a
connection and thus needs an extra layer for pooling.
Thus, I see query result caching as a way to pushing further
architectures which are already optimized for performance, not as a
band-aid for poor design solutions like the one-apache server with pooling.

Now, a proposition :

Here is where we are now, a typical slow query :
PREPARE myquery(text,integer)
EXECUTE myquery('john',2)

My proposition :
PREPARE myquery(text,integer)
PLANNED USING ('john',2)
CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
DEPENDS ON $1, $2
MAXIMUM CACHE TIME '5 minute'::interval
MINIMUM CACHE TIME '1 minute'::interval
MAXIMUM CACHE SIZE 2000000
AS SELECT count(*) as number FROM mytable WHERE myname=$2 AND myfield>=$1;

EXECUTE myquery('john',2)

Explainations :
-----------
PLANNED USING ('john',2)
Tells the planner to compute the stored query plan using the given
parameters. This is independent from caching but could be a nice feature
as it would avoid the possibility of storing a bad query plan.

-----------
CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
Specifies that the result is to be cached. There is an optional condition
(here, IF ...) telling postgres of when and where it should cache, or not
cache. It could be useful to avoid wasting cache space.
-----------
DEPENDS ON $1, $2
Defines the cache key. I don't know if this is useful, as the query
parameters make a pretty obvious cache key so why repeat them. It could be
used to add other data as a cache key, like :
DEPENDS ON (SELECT somefunction($1))
Also a syntax for specifying which tables should be watched for updates,
and which should be ignored, could be interesting.
-----------
MAXIMUM CACHE TIME '5 minute'::interval
Pretty obvious.
-----------
MINIMUM CACHE TIME '1 minute'::interval
This query is a count and I want a fast but imprecise count. Thus, I
specify a minimum cache time of 1 minute, meaning that the result will
stay in the cache even if the tables change. This is dangerous, so I'd
suggest the following :

MINIMUM CACHE TIME CASE WHEN result.number>10 THEN '1 minute'::interval
ELSE '5 second'::interval

Thus the cache time is an expression ; it is evaluated after performed
the query. There needs to be a way to access the 'count' result, which I
called 'result.number' because of the SELECT count() as number.
The result could also be used in the CACHE IF.

The idea here is that the count will vary over time, but we accept some
imprecision to gain speed. SWho cares if there are 225 or 227 messages in
a forum thread counter anyway ? However, if there are 2 messages, first
caching the query is less necessary because it's fast, and second a
variation in the count will be much easier to spot, thus we specify a
shorter cache duration for small counts and a longer duration for large
counts.

For queries returning result sets, this is not usable of course, but a
special feature for speeding count() queries would be welcome !

-----------
MAXIMUM CACHE SIZE 2000000
Pretty obvious. Size in bytes.

For queries returning several rows, MIN/MAX on result rows could be
useful also :
MAXIMUM RESULT ROWS nnn
Or maybe :
CACHE IF (select count(*) from result) > nnn

Thinking about it, using prepared queries seems a bad idea ; maybe the
cache should act on the result of functions. This would force the
application programmers to put the queries they want to optimize in
functions, but as function code is shared between connections and prepared
statements are not, maybe it would be easier to implement, and would
shield against some subtle bugs, like PREPARing the different queries
under the same name...

In that case the cache manager would also know if the function returns
SETOF or not, which would be interesting.

What do you think of these optimizations ?

Right now, a count() query cache could be implemented as a simple plsql
function with a table as the cache, by the way.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Pierre-Frédéric Caillaud 2004-10-03 13:27:46 Re: Caching of Queries
Previous Message Tom Lane 2004-10-02 23:44:21 Re: inconsistent/weird index usage