Quick Links

Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

From:	Gerhard Wiesinger <lists(at)wiesinger(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
Date:	2009-10-03 07:40:59
Message-ID:	alpine.LFD.2.00.0910030913240.16434@bbs.intern
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Fri, 2 Oct 2009, Simon Riggs wrote:

>
> On Sun, 2009-09-27 at 18:05 +0200, Gerhard Wiesinger wrote:
>
>> So I saw, that even on sequential reads (and also on bitmap heap scan acces)
>> PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck.
>>
>> A commercial software database vendor solved the problem by reading multiple
>> continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5
>> seconds on an equivalent "sequence scan":
>
> Is systemtap counting actual I/Os or just requests to access 8192 blocks
> once in OS cache? Postgres doesn't read more than one block at a time
> into its buffer pool, so those numbers of requests look about right.
>

As far as I know these are VFS reads. So some reads might be from cache
but since I did all requests should be from disk:
echo 3 > /proc/sys/vm/drop_caches;service postgresql restart
do benchmark

Same for all benchmarks because I don't want to measure cache performance
of OS or of the DB to benchmark.

Therefore all requests (except reaing twice or more but that shouldn't be
the case and would also be fine as cache hit) are from disk and not from
the cache.

> There is belief here that multi-block I/O was introduced prior to OS
> doing this as a standard mechanism. Linux expands its read ahead window
> in response to sequential scans and so this seems like something we
> don't want to do in the database.

I played even with large values on block device readaheads of /dev/md*,
/dev/sd* and /dev/dm-* as well as stripe_cache_size of /dev/md* but
without any performance improvements in the benmark scenarios.

=> All readaheads/caches don't seem to work in at least in the HEAP
BITMAP SCAN scenarios on nearly latest Linux kernels.

But I think such block issues (reading in largest blocks as possible) have
to be optimized on application level (in our case DB level) because
1.) We can't assume that OS and even storage works well in such scenarios
2.) We can't assume that OS/storage is intelligent enough to reduce number
of IOPS when 2 random blocks are at random 2 sequential blocks and that
therefore the number of IOPS is reduced.
3.) I think such a logic should be very easy to integrate and even has
been done with some patches.

>
> It's possible this is wrong. Is the table being scanned fairly sizable
> and was it allocated contiguously? i.e. was it a large table loaded via
> COPY?
>
> I also wonder if more L2 cache effects exist.
>

What do you mean with "table being scanned fairly sizable"? I don't get
it.

Table was filled with a lot of inserts but at one time point.

Ciao,
Gerhard

In response to

Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans at 2009-10-02 19:52:58 from Simon Riggs

Browse pgsql-general by date

	From	Date	Subject
Next Message	Thom Brown	2009-10-03 10:33:46	How useful is the money datatype?
Previous Message	Gerhard Wiesinger	2009-10-03 07:11:12	Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans