Re: Query results caching?

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Ben-Nes Yonatan" <da(at)canaan(dot)co(dot)il>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Sean Davis" <sdavis2(at)mail(dot)nih(dot)gov>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Query results caching?
Date: 2005-08-22 22:24:52
Message-ID: D425483C2C5C9F49B5B7A41F8944154757CFFB@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> -----Original Message-----
> From: Ben-Nes Yonatan [mailto:da(at)canaan(dot)co(dot)il]
> Sent: Monday, August 22, 2005 3:28 PM
> To: Jim C. Nasby; Sean Davis; Dann Corbit
> Cc: pgsql-general(at)postgresql(dot)org
> Subject: Re: [GENERAL] Query results caching?
>
> > On Mon, Aug 22, 2005 at 10:13:49PM +0200, Ben-Nes Yonatan wrote:
> >
> >>I think that I was misunderstood, Ill make an example:
> >>Lets say that im making the following query for the first time on
the
> >>"motorcycles" table which got an index on the "manufacturer" field:
> >>
> >>EXPLAIN ANALYZE SELECT manufacturer FROM motorcycles WHERE
> >>manufacturer='suzuki';
> >>... Total runtime: 3139.587 ms
> >>
> >>Now im doing the same query again and i get a much faster result
(cause
> >>of the "caching"): Total runtime: 332.53 ms
> >>
> >>After both of those queries I drop the index and query the table
again
> >>with the exact same query as before and now I receive: Total
runtime:
> >>216834.871 ms
> >>
> >>And for my last check I run the exact same query again (without
creating
> >>the INDEX back again) and I get quite similar result to my third
query:
> >>Total runtime: 209218.01 ms
> >>
> >>
> >>My problem is that (maybe I just dont understand something basic
> >>here...) the last 2 (also the second query but I dont care about
that)
> >>queries were using the "cache" that was created after the first
query
> >>(which had an INDEX) so none of them actually showed me what will
happen
> >>if a client will do such a search (without an INDEX) for the first
time.
> >>
> >>I want to delete that "caching" after I do the first 2 queries so my
> >>next queries will show me "real life results".
> >
> >
>
> Ok I tried to handle both of your replies cause I got them at 2
seperate
> emails.
>
> Dann Corbit wrote:
> > These results are all what I would expect. When you delete the
index,
> > the query will be forced to do a table scan (to examine every
single
> > record in the table one by one). If the table is non-trivial it is
> > unlikely that either the OS or the database will cache the whole
thing
> > in memory. However, when you query a small record set, then it is
> > likely to be retained in RAM which is literally thousands of times
> > faster than disk.
>
> Didnt know that, good to know though doesnt assure me...
> What if I drop the INDEX but create a diffrent INDEX which also make
the
> process alot faster then without an INDEX but slower/faster then the
one
> before, will it wont use the former "caching"?

You can add several indexes to a single table.
If you do some statistics on the query patterns, you can find what
indexes are needed to make the queries as fast as possible.

> > If a search is to be made on a frequent basis, you should create an
> > index.
> > The query results above show you why.
>
> Obvious :)
>
> > Think about this for a minute. The real life results you want are
> > very fast results. For that reason, you should try to model the
> > customer queries as nearly as possible. If you have a canned
> > application like order entry, then the real parameterized query set
> > will probably be quite small in real life. If you are creating a
> > server for ad-hoc queries then it will be far more difficult to
model
> > in real life.
> >
> > What is the real purpose of the application that you are writing?
> >
> > Will users be using a pre-programmed front end, or will they be
typing
> > in queries free-form for whatever their heart desires?
>
> Ok ill try to describe the system as short & precise as possible (its
> also passed midnight here :)).
> Each day I receive about 4 million rows of data (products) which I
> insert into table1 (after I delete all of the previous data it had),
> along it I receive for every row about another 15 keywords which I
> insert into table2 (where as in table1 I delete all of the previous
data
> it had also), this process is a fact that I cant change.

If the data arrives on a daily basis, and is not updated until the next
day, I suggest creating a lot of indexes, and cluster on the index used
most frequently.

What exactly are the 15 keywords in the second table for?
Are they column names?
Are they categories for the first table?
Why is the second table necessary at all?

> Now the users of the site can search for data from table1 by typing
> whichever (and up to 4) words as they want at a text field (search
input
> string) and the server should display the correct results by querying
> table1 & join table2 for its keywords.

Can you give the exact table definitions for the two tables, and also
the most likely queries you are going to receive?

When the users type in keywords -- can these keywords be applied against
any column in the table or only against a single column or against a
small set of columns or something else?

> I succeded to do it quite fast but when I tried to ORDER BY my results
> its times jumped up drastically (2-3 seconds for a query... and thats
> after the caching..).

Order by will complicate quite a bit. I have not tried it on
PostgreSQL, but if you know the result set is small, a technique is to
select into a temp table and then order by on the temp table. It works
well on other database systems (caveat: it has been a while since I
worked as a DBA and I have not worked as a DBA on PostgreSQL).

> I can't allow a situation where a user will search with a keyword
which
> wasnt 'cached' before and because of that he will wait 15 seconds for
a
> result.

You might try throwing hardware at it. A 4 CPU AMD 64 machine with
Ultra 320 striped SCSI disk array and a few gigabytes of ram will
perform admirably.

You might want more than is possible. If you have 4 million rows, and
each row is 1K, then that is 4 GB. If your users do a query that has
not been performed yet and you have to do a table scan, then you cannot
expect some kind of sub-second response times because it won't be
physically possible on any system.

If you know what most queries may look like or if you only have a few
character columns so that you can make an index on each of them and if
you can put a unique clustered index on the most important (frequently
used) item, then you can get the majority of your queries to run very
quickly, and only on rare occasions will the query be slow.

If I have a 4 million row table, with long rows and big varchar columns
and I run a query on a column like this:

SELECT * FROM inventory WHERE product LIKE '%Table%'

It isn't going to be fast on any system with any database.
[snip]

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ben-Nes Yonatan 2005-08-22 22:27:39 Re: Query results caching?
Previous Message Greg Stark 2005-08-22 21:35:19 Re: Testing of MVCC