Re: adding support for posix_fadvise()

From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Neil Conway <neilc(at)samurai(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: adding support for posix_fadvise()
Date: 2003-11-03 09:21:36
Message-ID: 1067851295.2580.12.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Neil Conway kirjutas E, 03.11.2003 kell 08:07:
> A couple days ago, Manfred Spraul mentioned the posix_fadvise() API on
> -hackers:
>
> http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
>
> I'm working on making use of posix_fadvise() where appropriate. I can
> think of the following places where this would be useful:
>
> (1) As Manfred originally noted, when we advance to a new XLOG segment,
> we can use POSIX_FADV_DONTNEED to let the kernel know we won't be
> accessing the old WAL segment anymore. I've attached a quick kludge of a
> patch that implements this. I haven't done any benchmarking of it yet,
> though (comments or benchmark results are welcome).
>
> (2) ISTM that we can set POSIX_FADV_RANDOM for *all* indexes, since the
> vast majority of the accesses to them shouldn't be sequential. Are there
> any situations in which this assumption doesn't hold? (Perhaps B+-tree
> bulk loading, or CLUSTER?) Should this be done per-index-AM, or
> globally?

Perhaps we could do it for all _leaf_ nodes, the root and intermediate
nodes are usually better kept in cache.

> (3) When doing VACUUM, ANALYZE, or large sequential scans (for some
> reasonable definition of "large"), we can use POSIX_FADV_SEQUENTIAL.

perhaps just sequential scans without "large" ?

> (4) Various other components, such as tuplestore, tuplesort, and any
> utility commands that need to scan through an entire user relation for
> some reason. Once we've got the APIs for doing this worked out, it
> should be relatively easy to add other uses of posix_fadvise().
>
> (5) I'm hesitant to make use of POSIX_FADV_DONTNEED in VACUUM, as has
> been suggested elsewhere. The problem is that it's all-or-nothing: if
> the VACUUM happens to look at hot pages, these will be flushed from the
> page cache, so the net result may be a loss.

True. POSIX_FADV_DONTNEED should be only used if the page was retrieved
by VACUUM.

> So what API is desirable for uses 2-4? I'm thinking of adding a new
> function to the smgr API, smgradvise(). Given a Relation and an advice,
> this would:
>
> (a) propagate the advice for this relation to all the open FDs for the
> relation
>
> (b) store the new advice somewhere so that new FDs for the relation can
> have this advice set for them: clients should just be able to call
> smgradvise() without needing to worry if someone else has already called
> smgropen() for the relation in the past. One problem is how to store
> this: I don't think it can be a field of RelationData, since that is
> transient. Any suggestions?

also, you may want to restore old FADV* after you are done - just
running one seqscan should probably not leave the relation in
POSIX_FADV_SEQUENTIAL mode forever.

> Note that I'm assuming that we don't need to set advice on sub-sections
> of a relation, although the posix_fadvise() API allows it -- does anyone
> think that would be useful?
>
> One potential issue is that when one process calls posix_fadvise() on a
> particular FD, I'd expect that other processes accessing the same file
> will be affected. For example, enabling FADV_SEQUENTIAL while we're
> vacuuming a relation will mean that another client doing a concurrent
> SELECT on the relation will see different readahead behavior. That
> doesn't seem like a major problem though.
>
> BTW, posix_fadvise() is currently only supported on Linux 2.6 w/ a
> recent version of glibc (BSD hackers, if you're listening,
> posix_fadvise() would be a very cool thing to have :P). So we'll need to
> do the appropriate configure magic to ensure we only use it where its
> available. Thankfully, it is a POSIX standard, so I would expect that in
> the years to come it will be available on more platforms.
>
> Any comments would be welcome.
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Miguel Biscaia 2003-11-03 09:40:33 unsubscribe pgsql-hackers@postgresql.org
Previous Message Hannu Krosing 2003-11-03 08:24:48 Re: Experimental patch for inter-page delay in VACUUM