Quick Links

Re: large tables and simple "= constant" queries using indexes

From:	PFC <lists(at)peufeu(dot)com>
To:	"John Beaver" <john(dot)e(dot)beaver(at)gmail(dot)com>, Matthew <matthew(at)flymine(dot)org>
Cc:	Pgsql-Performance <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: large tables and simple "= constant" queries using indexes
Date:	2008-04-10 21:37:50
Message-ID:	op.t9ezpch9cigqcu@apollo13.peufeu.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

> Thanks a lot, all of you - this is excellent advice. With the data
> clustered and statistics at a more reasonable value of 100, it now
> reproducibly takes even less time - 20-57 ms per query.

1000x speedup with proper tuning - always impressive, lol.
IO seeks are always your worst enemy.

> After reading the section on "Statistics Used By the Planner" in the
> manual, I was a little concerned that, while the statistics sped up the
> queries that I tried immeasurably, that the most_common_vals array was
> where the speedup was happening, and that the values which wouldn't fit
> in this array wouldn't be sped up. Though I couldn't offhand find an
> example where this occurred, the clustering approach seems intuitively
> like a much more complete and scalable solution, at least for a
> read-only table like this.

Actually, with statistics set to 100, then 100 values will be stored in
most_common_vals. This would mean that the values not in most_common_vals
will have less than 1% frequency, and probably much less than that. The
choice of plan for these rare values is pretty simple.

With two columns, "interesting" stuff can happen, like if you have col1
in [1...10] and col2 in [1...10] and use a condition on col1=const and
col2=const, the selectivity of the result depends not only on the
distribution of col1 and col2 but also their correlation.

As for the tests you did, it's hard to say without seeing the explain
analyze outputs. If you change the stats and the plan choice (EXPLAIN)
stays the same, and you use the same values in your query, any difference
in timing comes from caching, since postgres is executing the same plan
and therefore doing the exact same thing. Caching (from PG and from the
OS) can make the timings vary a lot.

> - Trying the same constant a second time gave an instantaneous result,
> I'm guessing because of query/result caching.

PG does not cache queries or results. It caches data & index pages in its
shared buffers, and then the OS adds another layer of the usual disk cache.
A simple query like selecting one row based on PK takes about 60
microseconds of CPU time, but if it needs one seek for the index and one
for the data it may take 20 ms waiting for the moving parts to move...
Hence, CLUSTER is a very useful tool.

Bitmap index scans love clustered tables because all the interesting rows
end up being grouped together, so much less pages need to be visited.

> - I didn't try decreasing the statistics back to 10 before I ran the
> cluster command, so I can't show the search times going up because of
> that. But I tried killing the 500 meg process. The new process uses less
> than 5 megs of ram, and still reproducibly returns a result in less than
> 60 ms. Again, this is with a statistics value of 100 and the data
> clustered by gene_prediction_view_gene_ref_key.

Killing it or just restarting postgres ?
If you let postgres run (not idle) for a while, naturally it will fill
the RAM up to the shared_buffers setting that you specified in the
configuration file. This is good, since grabbing data from postgres' own
cache is faster than having to make a syscall to the OS to get it from the
OS disk cache (or disk). This isn't bloat.
But what those 500 MB versus 6 MB show is that before, postgres had to
read a lot of data for your query, so it stayed in the cache ; after
tuning it needs to read much less data (thanks to CLUSTER) so the cache
stays empty.

In response to

Re: large tables and simple "= constant" queries using indexes at 2008-04-10 14:44:59 from John Beaver

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Erik Jones	2008-04-10 22:33:22	Re: shared_buffers in 8.2.x
Previous Message	Mark Stosberg	2008-04-10 21:28:51	Re: recommendations for web/db connection pooling or DBD::Gofer reviews