Quick Links

Re: How does PG know if data is in memory?

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Fabrício dos Anjos Silva <fabricio(dot)silva(at)linkcom(dot)com(dot)br>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: How does PG know if data is in memory?
Date:	2010-10-01 12:24:21
Message-ID:	4CA5D2F5.5070900@postnewspapers.com.au
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On 1/10/2010 7:12 PM, Fabrício dos Anjos Silva wrote:
>
> Thank you all for the replies.
>
> If PG does not know whether needed data is in memory, how does it
> estimate cost? There is a huge difference between access time in memory
> and in secondary storage. Not taking this into account results in almost
> "useless" estimates.

It's generally configured with the conservative assumption that data
will have to come from disk.

Note that the query planner's job isn't to figure out how long the query
will take. It's to compare various possible query plans and decide which
will be fastest. There are certainly cases where knowing what's cached
would help with this - for example: if an index is cached but the table
data isn't, it's more likely to be worth using the index to reduce disk
reads. But I don't know just how much difference it really makes.

Because the query often only wants a small subset of the data, and whole
relations are rarely fully cached, it's not enough to know that "some of
relation X is cached", it has to know if the cached parts are the parts
that'll be required, or at least an approximation of that. It sounds
horrendously complicated to keep track of to me, and in the end it won't
make query execution any faster, it'll just potentially help the planner
pick a better plan. I wonder if that'd be worth the extra CPU time spent
managing the cache and cache content stats, and using those cache stats
when planning? It'd be an interesting experiment, but the outcome is
hardly obvious.

As you can see, I don't really agree that the planner's estimates are
useless just because it's not very aware of the cache's current
contents. It has a pretty good idea of the system's memory and how much
of that can be used for cache, and knows how big various indexes and
relations are. That seems to work pretty well.

If some kind of cache awareness was to be added, I'd be interested in
seeing a "hotness" measure that tracked how heavily a given
relation/index has been accessed and how much has been read from it
recently. A sort of age-scaled blocks-per-second measure that includes
both cached and uncached (disk) reads. This would let the planner know
how likely parts of a given index/relation are to be cached in memory
without imposing the cost of tracking the cache in detail. I'm still not
sure it'd be all that useful, though...

> I am not saying that PG does a pour job, but I've
> been using it for 4 years and from time to time I notice very pour
> estimates.

Most of the issues reported here, at least, are statistics issues,
rather than lack of knowledge about cache status. The planner thinks
it'll find (say) 2 tuples maching a filter, and instead finds 100,000,
so it chooses a much less efficient join type. That sort of thing is
really independent of the cache state.

> Recently, I faced poor performance again, but this time because we
> started to work with larger tables (10M rows). This encourage me to
> study PG tuning again, trying to understand how the planner works and
> trying to get the best of it. Unfortunately, it does not seem to be an
> easy task.

No argument there! Like any database there's a fair bit of black magic
involved, and a whole lot of benchmarking. The key thing is to have
appropriate statistics (usually high), get a reasonable random_page_cost
and seq_page_cost, to set your effective cache size appropriately, and
to set reasonable work_mem.

"Reasonable" is hard to work out for work_mem, because Pg's work_mem
limit is per-sort (etc) not per-query or per-backend. I understand that
making it per-query is way, way harder than it sounds at face value,
though, so we must make do.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

In response to

Re: How does PG know if data is in memory? at 2010-10-01 11:12:36 from Fabrício dos Anjos Silva

Responses

Re: How does PG know if data is in memory? at 2010-10-01 13:45:34 from Pierre C
Re: How does PG know if data is in memory? at 2010-10-01 13:46:07 from Kevin Grittner
Re: How does PG know if data is in memory? at 2010-10-04 03:29:03 from Greg Smith

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Pierre C	2010-10-01 13:45:34	Re: How does PG know if data is in memory?
Previous Message	Fabrício dos Anjos Silva	2010-10-01 11:12:36	Re: How does PG know if data is in memory?