From: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
---|---|
To: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> |
Cc: | PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: Fix checkpoint skip logic on idle systems by tracking LSN progress |
Date: | 2016-09-28 07:32:40 |
Message-ID: | CAB7nPqT2m1aq2QRODVzWeatYyESCV0pKuEpzD3QOOowc8gH+EA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Sep 27, 2016 at 7:16 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I apologize in advance that the comments in this message might
> one of the ideas discarded in the past thread.. I might not grasp
> the discussion completely X(
No problem.
> At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA(at)mail(dot)gmail(dot)com>
>
>> A couple of months back is has been reported to pgsql-bugs that WAL
>> segments were always switched with a low value of archive_timeout even
>> if a system is completely idle:
>> http://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
>> In short, a closer look at the problem has showed up that the logic in
>> charge of checking if a checkpoint should be skipped or not is
>> currently broken, because it completely ignores standby snapshots in
>> its calculation of the WAL activity. So a checkpoint always occurs
>> after checkpoint_timeout on an idle system since hot_standby has been
>> introduced as wal_level. This did not get better from 9.4, since
>> standby snapshots are logged every 15s by the background writer
>> process. In 9.6, since wal_level = 'archive' and 'hot_standby'
>> actually has the same meaning, the skip logic that worked with
>> wal_level = 'archive' does not do its job anymore.
>>
>> One solution that has been discussed is to track the progress of WAL
>> activity when doing record insertion by being able to mark some
>> records as not updating the progress of WAL. Standby snapshot records
>> enter in this category, making the checkpoint skip logic more robust.
>> Attached is a patch implementing a solution for it, by adding in
>
> If I understand the old thread correctly, this still doesn't
> solve the main issue, excessive checkpoint and segment
> switching. The reason is the interaction between XLOG_SWITCH and
> checkpoint as mentioned there. I think we may treat XLOG_SWITCH
> as NO_PROGRESS, since we can recover to the lastest state without
> it. If it is not wrong, the second patch attached (v12-2) inserts
> XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
> progress took place for the round.
Possibly. That's a second problem I did not want to tackle now. I was
going to study that more precisely after this set of patches gets
done. There is already enough complication in them, and they solve a
large portion of the problem.
>> WALInsertLock a new field that gets updated for each record to track
>> the LSN progress. This allows to reliably skip the generation of
>> standby snapshots in the bgwriter or checkpoints on an idle system.
>
> WALInsertLock doesn't seem to me to be a good place for
> progressAt even considering the discussion about adding few
> instructions (containing a branch) in the
> hot-section. BackgroundWriterMain blocks all running
> XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
> replica, though). If this is correct, the Amit's suggestion below
> will have more significance, and I rather agree with it. XLogCtl
> is more suitable place for progressAt for the case.
Based on my past look at the problem and memories, having a variable
in WALInsertLock allows use to not have to touch the hottest spinlock
code path in WAL insertion and PG: ReserveXLogInsertLocation(). I'd
rather still avoid that.
>> > Also, I think it is worth to once take the performance data for
>> > write tests (using pgbench 30 minute run or some other way) with
>> > minimum checkpoint_timeout (i.e 30s) to see if the additional locking
>> > has any impact on performance. I think taking locks at intervals
>> > of 15s or 30s should not matter much, but it is better to be safe.
>>
>> I don't think performance will be impacted, but there are no reasons
>> to not do any measurements either. I'll try to get some graphs
>> tomorrow with runs on my laptop, mainly looking for any effects of
>> this patch on TPS when checkpoints show up.
>
> I don't think the impact is measurable on a laptop, where 2 to 4
> cores available at most.
Yeah, I couldn't either.. Still I would like to keep the hot spinlock
section as small as possible if we can.
>> Per discussion with Andres at PGcon, we decided that this is an
>> optimization, only for 9.7~ because this has been broken for a long
>> time. I have also changed XLogIncludeOrigin() to use a more generic
>> routine to set of status flags for a record being inserted:
>> XLogSetFlags(). This routine can use two flags:
>> - INCLUDE_ORIGIN to decide if the origin should be logged or not
>> - NO_PROGRESS to decide at insertion if a record should update the LSN
>> progress or not.
>> Andres mentioned me that we'd want to have something similar to
>> XLogIncludeOrigin, but while hacking I noticed that grouping both
>> things under the same umbrella made more sense.
>
> This looks reasonable.
Thanks. That would be fine as a first, independent patch, but that
would mean that XLogSetFlags has only only value, which is a bit
pointless so I grouped them. And this makes the exiting interface
cleaner as well for replication origins.
--
Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2016-09-28 07:35:52 | Re: Fix checkpoint skip logic on idle systems by tracking LSN progress |
Previous Message | Michael Paquier | 2016-09-28 06:54:28 | Re: Tracking wait event for latches |