Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc: "Mark Mielke" <mark(at)mark(dot)mielke(dot)cc>, "Carlo Stonebanks" <stonec(dot)register(at)sympatico(dot)ca>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1
Date: 2007-09-13 17:46:42
Message-ID: 877imua265.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


"Gregory Stark" <stark(at)enterprisedb(dot)com> writes:

> "Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
>
>> Right now the pattern for index scan goes like this:
>>
>> - Find qualifying TID in index
>> - Seek to TID location in relfile
>> - Acquire tuple from relfile, return
>>...
>> If we implement AIO and allow for multiple pending I/Os used to prefetch
>> groups of qualifying tuples, basically a form of random readahead
>
> Ah, I see what you mean now. It makes a lot more sense if you think of it for
> bitmap index scans. So, for example, the bitmap index scan could stream tids
> to the executor and the executor would strip out the block numbers and pass
> them to the i/o layer saying "i need this block now but following that I'll
> need these blocks so get them moving now".

Wow, I've done some preliminary testing here on Linux using posix_fadvise and
Solaris using libaio to prefetch blocks and then access them randomly and I
think there's a lot of low hanging fruit here.

The use case where this helps is indeed on a raid array where you're not
maxing out the bandwidth of the array and care about the transaction latency,
perhaps a narrow use case but still, quite common.

Since our random access is synchronous it means we have to wait for one seek,
process that page, then wait for the next seek on another drive which was
sitting idle while we were processing the first page. By prefetching the pages
we'll need next we can get all the members of the array working for us
simultaneously even if they're all doing seeks.

What I've done is write a test program which generates a 1G file, syncs it and
drops the caches (not working yet on Solaris but doesn't seem to affect the
results) and then picks 4096 8k buffers and reads them in random order. The
machines it's running on have a small raid array with 4 drives.

Just seeking without any prefetch it takes about 12.5s on Linux and 13.5s on
Solaris. If I prefetch even a single buffer using posix_fadvise or libaio I
see a noticeable improvement, over 25%. At 128 buffers of prefetch both
systems are down to about 2.5-2.7s. That's on the small raid array. On the
boot both have a small beneficial effect but only at very large prefetch set
sizes which I would chalk down to being able to re-order the reads even if it
can't overlap them.

I want to test how much of this effect evaporates when I compare it to a
bitmap index style scan but that depends on a lot of factors like the exact
pattern of file extensions on the database files. In any case bitmap index
scans get us the reordering effect, but not the overlapping i/o requests
assuming they're spread quite far apart in the data files.

> I think this seems pretty impractical for regular (non-bitmap) index probes
> though. You might be able to do it sometimes but not very effectively and you
> won't know when it would be useful.

How useful this is depends a lot on how invasively we let it infect things
like regular index scans. If we can prefetch right siblings and deeper index
pages as we descend an index tree and future heap pages it could help a lot as
those aren't sorted like bitmap index scans. But even if we only fetch heap
pages all together before processing the heap pages it could be a big help.

Incidentally we do need to try to make use of both as Solaris doesn't have
posix_fadvise as far as I can tell and Linux's libaio doesn't support
non-O_DIRECT files.

Raw data:

Blocks Linux Solaris
Prefetched posix_fadvise libaio
---------------------------------------
1 12.473 13.597
2 9.053 9.830
4 6.787 7.594
8 5.303 6.588
16 4.209 5.120
32 3.388 4.014
64 2.869 3.216
128 2.515 2.710
256 2.312 2.327
512 2.168 2.099
1024 2.139 1.974
2048 2.242 1.903
4096 2.222 1.890

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Brad Nicholson 2007-09-13 17:58:19 Re: Long Running Commits - Not Checkpoints
Previous Message Alvaro Herrera 2007-09-13 17:07:19 Re: Long Running Commits - Not Checkpoints