Re: [HACKERS] TODO item

From: wieck(at)debis(dot)com (Jan Wieck)
To: PostgreSQL HACKERS <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [HACKERS] TODO item
Date: 2000-02-08 12:01:29
Message-ID: m12I9Kj-0003kGC@orion.SAPserv.Hamburg.dsh.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I see where you're going, and you could possibly make it work, but
> there are a bunch of problems. One objection is that kernel FDs
> are a very finite resource on a lot of platforms --- you don't really
> want to tie up one FD for every dirty buffer, and you *certainly*
> don't want to get into a situation where you can't release kernel
> FDs until end of xact. You might be able to get around that by
> associating the fsync-needed bit with VFDs instead of FDs.

Reminds me to the usefulness of some kind of tablespace
storage manager. It might not buy us a single saved byte on
disk, or maybe cost us some extra. But it would save file
descriptors.

And if this storage manager would work with some amount of
preallocated blocks, it would be totally happy with a
fdatasync() instead of a fsync(). Some per tablespace
configurable options like initial number of blocks, next
extent size and percentage increase would be fine.

Before someone asks, the difference between a fdatasync() and
a fsync() is, that the first only forces modified data blocks
to be flushed to disk. A fsync() causes the inode to be
flushed too, because at least it has a new modtime. In our
case, where writes to files can cause block allocations, it
is a requirement to flush the inode on modifications. But if
dealing with a file where blocks are already allocated (no
null faking or write behind the EOF), it is not that
important. Any difference you might see after a crash can be
a slightly different last modification time, and this really
doesn't count.

The result of that difference is, that a write()+fsync()
nearly allways causes head seeks on the disk (except the
inode and dirty blocks are on the same cylinder). In contrast
a series of write()+fdatasync() calls for one and the same
file, all blocks close together, wouldn't. And isn't that
what our backends usually do?

Having immediate SCSI error reporting enabled on the disks,
such a burst of write()+fdatasync() calls wouln't have such a
big performance impact any more. In that case, the
fdatasync() call will return already at the time, the flushed
blocks reached the on-disk cache. Not waiting until they are
burned into the surface.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck(at)debis(dot)com (Jan Wieck) #

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Chris 2000-02-08 12:09:07 New Patch
Previous Message Peter Eisentraut 2000-02-08 11:50:29 Re: [HACKERS] How to make a patch?