From: | John Beaver <john(dot)e(dot)beaver(at)gmail(dot)com> |
---|---|
To: | Matthew <matthew(at)flymine(dot)org> |
Cc: | Pgsql-Performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: large tables and simple "= constant" queries using indexes |
Date: | 2008-04-10 14:44:59 |
Message-ID: | 47FE27EB.8050804@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Thanks a lot, all of you - this is excellent advice. With the data
clustered and statistics at a more reasonable value of 100, it now
reproducibly takes even less time - 20-57 ms per query.
After reading the section on "Statistics Used By the Planner" in the
manual, I was a little concerned that, while the statistics sped up the
queries that I tried immeasurably, that the most_common_vals array was
where the speedup was happening, and that the values which wouldn't fit
in this array wouldn't be sped up. Though I couldn't offhand find an
example where this occurred, the clustering approach seems intuitively
like a much more complete and scalable solution, at least for a
read-only table like this.
As to whether the entire index/table was getting into ram between my
statistics calls, I don't think this was the case. Here's the behavior
that I found:
- With statistics at 10, the query took 25 (or so) seconds no matter how
many times I tried different values. The query plan was the same as for
the 200 and 800 statistics below.
- Trying the same constant a second time gave an instantaneous result,
I'm guessing because of query/result caching.
- Immediately on increasing the statistics to 200, the query took a
reproducibly less amount of time. I tried about 10 different values
- Immediately on increasing the statistics to 800, the query
reproducibly took less than a second every time. I tried about 30
different values.
- Decreasing the statistics to 100 and running the cluster command
brought it to 57 ms per query.
- The Activity Monitor (OSX) lists the relevant postgres process as
taking a little less than 500 megs.
- I didn't try decreasing the statistics back to 10 before I ran the
cluster command, so I can't show the search times going up because of
that. But I tried killing the 500 meg process. The new process uses less
than 5 megs of ram, and still reproducibly returns a result in less than
60 ms. Again, this is with a statistics value of 100 and the data
clustered by gene_prediction_view_gene_ref_key.
And I'll consider the idea of using triggers with an ancillary table for
other purposes; seems like it could be a useful solution for something.
Matthew wrote:
> On Thu, 10 Apr 2008, PFC wrote:
>
> ... Lots of useful advice ...
>
>> - If you often query rows with the same gene_ref, consider using
>> CLUSTER to physically group those rows on disk. This way you can get
>> all rows with the same gene_ref in 1 seek instead of 2000. Clustered
>> tables also make Bitmap scan happy.
>
> In my opinion this is the one that will make the most difference. You
> will need to run:
>
> CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;
>
> after you insert significant amounts of data into the table. This
> re-orders the table according to the index, but new data is always
> written out of order, so after adding lots more data the table will
> need to be re-clustered again.
>
>> - Switch to a RAID10 (4 times the IOs per second, however zero gain
>> if you're single-threaded, but massive gain when concurrent)
>
> Greg Stark has a patch in the pipeline that will change this, for
> bitmap index scans, by using fadvise(), so a single thread can utilise
> multiple discs in a RAID array.
>
> Matthew
>
From | Date | Subject | |
---|---|---|---|
Next Message | Jon Stewart | 2008-04-10 14:48:59 | Creating large database of MD5 hash values |
Previous Message | Gaetano Mendola | 2008-04-10 12:39:28 | shared_buffers in 8.2.x |