WAL and commit_delay

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: WAL and commit_delay
Date: 2001-02-17 18:05:53
Message-ID: 200102171805.NAA24180@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I want to give some background on commit_delay, its initial purpose, and
possible options.

First, looking at the process that happens during a commit:

write() - copy WAL dirty page to kernel disk buffer
fsync() - force WAL kernel disk buffer to disk platter

fsync() take much longer than write().

What Vadim doesn't want is:

time backend 1 backend 2
---- --------- ---------
0 write()
1 fysnc() write()
2 fsync()

This would be better as:

time backend 1 backend 2
---- --------- ---------
0 write()
1 write()
2 fsync() fsync()

This was the purpose of the commit_delay. Having two fsync()'s is not a
problem because only one will see there are dirty buffers. The other
will probably either return right away, or wait for the other's fsync()
to complete.

With the delay, it looks like:

time backend 1 backend 2
---- --------- ---------
0 write()
1 sleep() write()
2 fsync() sleep()
3 fsync()

Which shows the second fsync() doing nothing, which is good, because
there are no dirty buffers at that time. However, a very possible
circumstance is:

time backend 1 backend 2 backend 3
---- --------- --------- ---------
0 write()
1 sleep() write()
2 fsync() sleep() write()
3 fsync() sleep()
4 fsync()

In this case, the fsync() by backend 2 does indeed do some work because
fsync's backend 3's write(). Frankly, I don't see how the sleep does
much except delay things because it doesn't have any smarts about when
the delay is useful, and when it is useless. Without that feedback, I
recommend removing the entire setting. For single backends, the sleep
is clearly a loser.

Another situation it can not deal with is:

time backend 1 backend 2
---- --------- ---------
0 write()
1 sleep()
2 fsync() write()
3 sleep()
4 fsync()

My solution can't deal with this either.

---------------------------------------------------------------------------

The quick fix is to remove the commit_delay code. A more elaborate
performance boost would be to have the each backend get feedback from
other backends, so they can block and wait for other about-to-fsync
backends before fsync(). This allows the write() to bunch up before
the fsync().

Here is the single backend case, which experiences no delays:

time backend 1 backend 2
---- --------- ---------
0 get_shlock()
1 write()
2 rel_shlock()
3 get_exlock()
4 rel_exlock()
5 fsync()

Here is the two-backend case, which shows both write()'s completing
before the fsync()'s:

time backend 1 backend 2
---- --------- ---------
0 get_shlock()
1 write()
2 rel_shlock() get_shlock()
3 get_exlock() write()
4 rel_shlock()
5 rel_exlock()
6 fsync() get_exlock()
7 rel_exlock()
8 fsync()

Contrast that with the first 2 backend case presented above:

time backend 1 backend 2
---- --------- ---------
0 write()
1 fysnc() write()
2 fsync()

Now, it is my understanding that instead of just shared locking around
the write()'s, we could block the entire commit code, so the backend can
signal to other about-to-fsync backends to wait.

I believe our existing lock code can be used for the locking/unlocking.
We can just lock a random, unused table log pg_log or something.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2001-02-17 18:08:53 Re: Microsecond sleeps with select()
Previous Message Tom Lane 2001-02-17 17:30:02 Re: Re: beta5 ...