Re: Proposed LogWriter Scheme, WAS: Potential Large

From: Hannu Krosing <hannu(at)tm(dot)ee>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Curtis Faith <curtis(at)galtair(dot)com>, Pgsql-Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposed LogWriter Scheme, WAS: Potential Large
Date: 2002-10-05 18:29:45
Message-ID: 1033842585.2681.35.camel@rh72.home.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2002-10-05 at 20:32, Tom Lane wrote:
> Hannu Krosing <hannu(at)tm(dot)ee> writes:
> > The writer process should just issue a continuous stream of
> > aio_write()'s while there are any waiters and keep track which waiters
> > are safe to continue - thus no guessing of who's gonna commit.
>
> This recipe sounds like "eat I/O bandwidth whether we need it or not".
> It might be optimal in the case where activity is so heavy that we
> do actually need a WAL write on every disk revolution, but in any
> scenario where we're not maxing out the WAL disk's bandwidth, it will
> hurt performance. In particular, it would seriously degrade performance
> if the WAL file isn't on its own spindle but has to share bandwidth with
> data file access.
>
> What we really want, of course, is "write on every revolution where
> there's something worth writing" --- either we've filled a WAL blovk
> or there is a commit pending.

That's what I meant by "while there are any waiters".

> But that just gets us back into the
> same swamp of how-do-you-guess-whether-more-commits-will-arrive-soon.
> I don't see how an extra process makes that problem any easier.

I still think that we could get gang writes automatically, if we just
ask for aio_write at completion of each WAL file page and keep track of
those that are written. We could also keep track of write position
inside the WAL page for

1. end of last write() of each process

2. WAL files write position at each aio_write()

Then we can safely(?) assume, that each backend wants only its own
write()'s be on disk before it can assume the trx has committed. If the
fsync()-like request comes in at time when aio_write for that processes
last position has committed, we can let that process continue without
even a context switch.

In the above scenario I assume that kernel can do the right thing by
doing multiple aio_write requests for the same page in one sweep and not
doing one physical write for each aio_write.

> BTW, it would seem to me that aio_write() buys nothing over plain write()
> in terms of ability to gang writes. If we issue the write at time T
> and it completes at T+X, we really know nothing about exactly when in
> that interval the data was read out of our WAL buffers.

Yes, most likely. If we do several write's of the same pages they will
hit physical disk at the same physical write.

> We cannot
> assume that commit records that were stored into the WAL buffer during
> that interval got written to disk. The only safe assumption is that
> only records that were in the buffer at time T are down to disk; and
> that means that late arrivals lose.

I assume that if each commit record issues an aio_write when all of
those which actually reached the disk will be notified.

IOW the first aio_write orders the write, but all the latecomers which
arrive before actual write will also get written and notified.

> You can't issue aio_write
> immediately after the previous one completes and expect that this
> optimizes performance --- you have to delay it as long as you possibly
> can in hopes that more commit records arrive.

I guess we have quite different cases for different hardware
configurations - if we have a separate disk subsystem for WAL, we may
want to keep the log flowing to disk as fast as it is ready, including
the writing of last, partial page as often as new writes to it are done
- as we possibly can't write more than ~ 250 times/sec (with 15K drives,
no battery RAM) we will always have at least two context switches
between writes (for 500Hz ontext switch clock), and much more if
processes background themselves while waiting for small transactions to
commit.

> So it comes down to being the same problem.

Or its solution ;) as instead of the predicting we just write all data
in log that is ready to be written. If we postpone writing, there will
be hickups when we suddenly discover that we need to write a whole lot
of pages (fsync()) after idling the disk for some period.

---------------
Hannu

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2002-10-05 19:02:22 Re: [SQL] [GENERAL] CURRENT_TIMESTAMP
Previous Message Bruce Momjian 2002-10-05 18:26:43 Re: Proposed LogWriter Scheme, WAS: Potential Large Performance