Bug: Buffer cache is not scan resistant

From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>, "Doug Rady" <drady(at)greenplum(dot)com>, "Sherry Moore" <sherry(dot)moore(at)sun(dot)com>
Subject: Bug: Buffer cache is not scan resistant
Date: 2007-03-04 19:11:26
Message-ID: C21059DE.2862E%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not "scan
resistant" as advertised. A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and
"memtable" is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation memtable bigtable
---------------------------------------------------
SELECT COUNT(*) 1221 MB/s 973 MB/s
VACUUM 1709 MB/s 1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT. In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well. The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache. Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it. In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation memtable bigtable
---------------------------------------------------
SELECT COUNT(*) 1320 MB/s 1059 MB/s
VACUUM 3033 MB/s 1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm? Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.

- Luke

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ryan Cumming 2007-03-04 21:33:59 Re: Trivial HugeTLB Benchmark
Previous Message Joshua D. Drake 2007-03-04 18:51:01 Re: ERROR: operator does not exist: integer !=- integer