Re: [GENERAL] medical image on postgreSQL?

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] medical image on postgreSQL?
Date: 2003-04-11 23:43:56
Message-ID: 20030411234356.GR79923@perrin.int.nxad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

> > Other zero copy socket operations are mmap() + write(), but last I
> > heard, that was a FreeBSD only thing... for now. man 2 sendfile
>
> There's a lot of resistance to the optimizating mmap+write in the
> linux camp. It isn't just a matter of time, the developers there
> actively think this is a bad idea. In fact the code has been written
> several times and is never accepted. They think developers should be
> encouraged to use sendfile and the common code path for write
> shouldn't be wasting cycles checking for special cases in the page
> table.

Well, I won't go into how well/poorly Linux's VM is written... that
said, I suppose I sympathize with the developers in the linux camp
that want to avoid this issue... this isn't easy to do
right/elegantly and it took BSD quite a while to get right, iirc.

> Note that there are some protocol requirements for sendfile to be
> feasible. There has to be zero alterations made to the data in
> flight. No escaping, decompression, etc. And there has to be no
> cases when the program would want to stop transmitting partway
> through. I think you can send a portion of a file but you would have
> to know the size of the chunk up front and the best performance
> would be if the chunk is very large.

I can speak from personal experience under huge loads (60K+
connections to a single webserver) that for small files, it is
advantageous to use mmap() + write() instead of sendfile().
sendfile() has a pretty funky API that isn't the cleanest out there
and requires a small state machine per file being sent and is more
complex for nonblocking IO, but it's still better. As for
performance, mmap() + write() is _faster_ than sendfile() for small
files that can be cached by the FS cache layer. What's odd, however,
is that I found it only marginally faster (1-3ms?) and I'm not
convinced that the speed up wasn't from sending data from the local
box (mmap()) instead of being pulled over NFS (sendfile()).
sendfile() is pretty slick and I'd recommend its use anywhere over
read() + write().

FWIW, cache coherency isn't an issue for well written VMs though
(*rub*). The data can change under sendfile()'s feet and that's okay,
BSD handles this correctly (nevermind MVCC prevents this from being a
problem). Writing data to a page that's mmap()'ed is also sync'ed and
cache coherency isn't an issue for so long as the page is shared and
sync'ed with disk periodically.

-sc

--
Sean Chittenden

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Stephan Szabo 2003-04-12 00:08:49 Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit
Previous Message Ed L. 2003-04-11 23:43:02 Re: Batch replication ordering (was Re: [GENERAL] 32/64-bit

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2003-04-11 23:55:00 Re: backend dies when C function calls C++ library that throws an exception
Previous Message David Blasby 2003-04-11 23:20:59 backend dies when C function calls C++ library that throws an exception