From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | our buffer replacement strategy is kind of lame |
Date: | 2011-08-12 04:05:31 |
Message-ID: | CA+Tgmob_8_wiUV9qkF3FgMf2h7fv93=0xap15B+2DeFK8r3kPg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While I was poking around at the index-only scans patch, I couldn't
help noticing that our buffer replacement algorithm leaves something
to be desired. Here's the query again:
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
As sample_data is not indexed, it sequential scans that table and does
an index-only scan of pgbench_accounts for each aid. When this is
done, here's what pg_buffercache has to say:
rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
usagecount | sum
------------+-------
1 | 136
2 | 12
3 | 72
4 | 7
5 | 13755
| 37218
(6 rows)
Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around. That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless. The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all). On
the other hand, the buffer manager has *no problem at all* trashing
the buffer arena if we're faulting in pages for an index scan rather
than a sequential scan. If you manage to get all of sample_data into
memory (by running many copies of the above query in parallel, you can
get each one to allocate its own ring buffer, and eventually pull in
all the pages), and then run some query that probes an index which is
too large to fit in shared_buffers, it cheerfully blows the whole
sample_data table out without a second thought. Had you sequentially
scanned a big table, of course, there would be some protection, but an
index scan can stomp all over everything with complete impunity.
The general problem here is that we are not very smart about handling
workloads with weak locality - i.e. the working set is larger than
shared buffers. If the working set fits in shared_buffers, we will
keep it there, and it will be strongly protected against eviction.
But if the working set *doesn't* fit in shared buffers, then we fall
back on some not-very-smart heuristics to decide what to do: keep
pages involved in index scans, ditch those involved in sequential
scans.
It seems to me that we need a smarter algorithm. It seems that NetBSD
has adopted the use of Clock-Pro, and there are some discussions of it
being used in Linux as well, though I'm not sure whether that's
actually (still?) the case. Paper is here, although I find the
description of the algorithm to be approximately clear as mud:
http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf
One of the important notions which many of the algorithms in the
literature seem to hit on in one way or another is that you mustn't go
and decide that all of your buffers - or nearly all your buffers - are
hot. That's really equivalent to saying that none of your buffers are
hot; and it's also pretty easy to see that such a scenario would be
disastrous for our current algorithm, which - if everything in
shared_buffers has usage-count 5 all the time - will have to clock
sweep a long way each time it needs to allocate a buffer. In fact,
the more of your buffers want to be hot (and thus hard to evict), the
fewer of them should actually be allowed to acquire that status.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2011-08-12 08:23:55 | Re: WIP: Fast GiST index build |
Previous Message | Robert Haas | 2011-08-12 03:22:16 | Re: index-only scans |