Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-06-16 20:48:08
Message-ID: 20130616204808.GC17598@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote:
> Another thought is that rather than trying to compensate for that effect in
> the checkpoint scheduler, could we avoid the sudden rush of full-page images
> in the first place? The current rule for when to write a full page image is
> conservative: you don't actually need to write a full page image when you
> modify a buffer that's sitting in the buffer cache, if that buffer hasn't
> been flushed to disk by the checkpointer yet, because the checkpointer will
> write and fsync it later. I'm not sure how much it would smoothen WAL write
> I/O, but it would be interesting to try.

Hm. Could you elaborate why that wouldn't open new hazards? I don't see
how that could be safe against crashes in some places. It seems to me
we could end up replaying records like heap_insert or similar pages
while the page is still torn?

> A long time ago, Itagaki wrote a patch to sort the checkpoint writes: www.postgresql.org/message-id/flat/20070614153758(dot)6A62(dot)ITAGAKI(dot)TAKAHIRO(at)oss(dot)ntt(dot)co(dot)jp(dot)
> He posted very promising performance numbers, but it was dropped because Tom
> couldn't reproduce the numbers, and because sorting requires allocating a
> large array, which has the risk of running out of memory, which would be bad
> when you're trying to checkpoint.

Hm. We could allocate the array early on since the number of buffers
doesn't change. Sure that would be pessimistic, but that seems fine.

Alternatively I can very well imagine that it would still be beneficial
to sort the dirty buffers in shared buffers. I.e. scan till we found 50k
dirty pages, sort them and only then write them out.

> Apart from the direct performance impact of that patch, sorting the writes
> would allow us to interleave the fsyncs with the writes. You would write out
> all buffers for relation A, then fsync it, then all buffers for relation B,
> then fsync it, and so forth. That would naturally spread out the
> fsyncs.

I personally think that optionally trying to force the pages to be
written out earlier (say, with sync_file_range) to make the actual
fsync() lateron cheaper is likely to be better overall.

> If we don't mind scanning the buffer cache several times, we don't
> necessarily even need to sort the writes for that. Just scan the buffer
> cache for all buffers belonging to relation A, then fsync it. Then scan the
> buffer cache again, for all buffers belonging to relation B, then fsync
> that, and so forth.

That would end up with quite a lot of scans in a reasonably sized
machines. Not to talk of those that have a million+ relations. That
doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff
we did for 9.3 to make it cheaper to drop a bunch of relations at once
by only scanning shared_buffers once.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2013-06-16 22:02:21 Re: minor patch submission: CREATE CAST ... AS EXPLICIT
Previous Message Andres Freund 2013-06-16 20:38:46 Re: spurious wrap-around shutdown