Re: Partitioned checkpointing

From: Takashi Horikawa <t-horikawa(at)aj(dot)jp(dot)nec(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Partitioned checkpointing
Date: 2015-09-12 04:28:47
Message-ID: 73FA3881462C614096F815F75628AFCD03558B78@BPXM01GP.gisp.nec.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Andres,
Thank you for discussion. It’s nice for me to discuss here.

> Partitioned checkpoint have the significant disadvantage that it increases
> random write io by the number of passes. Which is a bad idea,
> *especially* on SSDs.
I’m curious about the conclusion that the Partitioned Checkpinting
increases random write io by the number of passes.

The order of buffer syncs in original checkpointing is as follows.
b[0], b[1], ….. , b[N-1]
where b[i] means buffer[i], N is the number of buffers.
(To make the story simple, starting point of buffer sync determined by the
statement ‘buf_id = StrategySyncStart(NULL, NULL);’ in BufferSync() is
ignored here. If it is important in this discussion, please note that.)

While partitioned checkpointing is as follows.
1st round
b[0], b[p], b[2p], … b[(n-1)p]
2nd round
b[1], b[p+1], b[2p+1], … b[(n-1)p+1]

last round
b[p-1], b[p+(p-1)], b[2p+(p-1)], … b[(n-1)p+(p-1)]
where p is the number of partitions and n = (N / p).

I think important here is that the ‘Partitioned checkpointing’ does not
change (increase) the total number of buffer writes.
I wonder why the sequence of b[0], b[1], ….. , b[N-1] is less random than
that of b[0], b[p], b[2p], … b[(n-1)p]. I think there is no relationship
between the neighboring buffers, like b[0] and b[1]. Is this wrong?

Also, I believe that random ‘PAGE’ writes are not harmful for SSDs.
(Buffer sync is carried out in the unit of 8 Kbyte page.) Harmful for SSDs
is partial write (write size is less than PAGE size) because it increases
the write-amplitude of the SSD, resulting in shortening its lifetime. On the
other hand, IIRC, random ‘PAGE’ writes do not increase the write
amplitude. Wearleveling algorithm of the SSD should effectively handle
random ‘PAGE’ writes.

> I think it's likely that the patch will have only a very small effect if
> applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
It may be true or not. Who knows?
I think only detail experimentations tell the truth.

Best regards.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Andres Freund
> Sent: Saturday, September 12, 2015 1:30 AM
> To: Tomas Vondra
> Cc: pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Partitioned checkpointing
>
> Hi,
>
> Partitioned checkpoint have the significant disadvantage that it increases
> random write io by the number of passes. Which is a bad idea,
> *especially* on SSDs.
>
> > >So we'd need logic like this
> > >1. Run through shared buffers and analyze the files contained in
> > >there 2. Assign files to one of N batches so we can make N roughly
> > >equal sized mini-checkpoints 3. Make N passes through shared buffers,
> > >writing out files assigned to each batch as we go
>
> That's essentially what Fabien's sorting patch does by sorting all writes.
>
> > What I think might work better is actually keeping the write/fsync
> > phases we have now, but instead of postponing the fsyncs until the
> > next checkpoint we might spread them after the writes. So with
> > target=0.5 we'd do the writes in the first half, then the fsyncs in
> > the other half. Of course, we should sort the data like you propose,
> > and issue the fsyncs in the same order (so that the OS has time to write
> them to the devices).
>
> I think the approach in Fabien's patch of enforcing that there's not very
> much dirty data to flush by forcing early cache flushes is better. Having
> gigabytes worth of dirty data in the OS page cache can have massive
negative
> impact completely independent of fsyncs.
>
> > I wonder how much the original paper (written in 1996) is effectively
> > obsoleted by spread checkpoints, but the benchmark results posted by
> > Horikawa-san suggest there's a possible gain. But perhaps partitioning
> > the checkpoints is not the best approach?
>
> I think it's likely that the patch will have only a very small effect if
> applied ontop of Fabien's patch (which'll require some massaging I'm
sure).
>
> Greetings,
>
> Andres Freund
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2015-09-12 06:52:13 Re: Review: check existency of table for -t option (pg_dump) when pattern...
Previous Message Takashi Horikawa 2015-09-12 03:49:53 Re: Partitioned checkpointing