From: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: checkpointer continuous flushing |
Date: | 2016-01-07 15:05:32 |
Message-ID: | alpine.DEB.2.10.1601071533020.5278@sto |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Andres,
>> I thought of adding a pointer to the current flush structure at the vfd
>> level, so that on closing a file with a flush in progress the flush can be
>> done and the structure properly cleaned up, hence later the checkpointer
>> would see a clean thing and be able to skip it instead of generating flushes
>> on a closed file or on a different file...
>>
>> Maybe I'm missing something, but that is the plan I had in mind.
>
> That might work, although it'd not be pretty (not fatally so
> though).
Alas, any solution has to communicate somehow between the API levels, so
it cannot be "pretty", although we should avoid the worse.
> But I'm inclined to go a different way: I think it's a mistake to do
> flusing based on a single file. It seems better to track a fixed number
> of outstanding 'block flushes', independent of the file. Whenever the
> number of outstanding blocks is exceeded, sort that list, and flush all
> outstanding flush requests after merging neighbouring flushes.
Hmmm. I'm not sure I understand your strategy.
I do not think that flushing without a prior sorting would be effective,
because there is no clear reason why buffers written together would then
be next to the other and thus give sequential write benefits, we would
just get flushed random IO, I tested that and it worked badly.
One of the point of aggregating flushes is that the range flush call cost
is significant, as shown by preliminary tests I did, probably up in the
thread, so it makes sense to limit this cost, hence the aggregation. These
removed some performation regression I had in some cases.
Also, the granularity of the buffer flush call is a file + offset + size,
so necessarily it should be done this way (i.e. per file).
Once buffers are sorted per file and offset within file, then written
buffers are as close as possible one after the other, the merging is very
easy to compute (it is done on the fly, no need to keep the list of
buffers for instance), it is optimally effective, and when the
checkpointed file changes then we will never go back to it before the next
checkpoint, so there is no reason not to flush right then.
So basically I do not see a clear positive advantage to your suggestion,
especially when taking into consideration the scheduling process of the
scheduler:
In effect the checkpointer already works with little bursts of activity
between sleep phases, so that it writes buffers a few at a time, so it may
already work more or less as you expect, but not for the same reason.
The closest stategy that I experimented which is maybe close to your
suggestion was to manage a minimum number of buffers to write when awaken
and to change the sleep delay in between, but I had no clear way to choose
values and the experiments I did did not show significant performance
impact by varying these parameters, so I kept that out. If you find a
magic number of buffer which results in consistant better performance,
fine with me, but this is independent with aggregating before or after.
> Imo that means that we'd better track writes on a relfilenode + block
> number level.
I do not think that it is a better option. Moreover, the current approach
has been proven to be very effective on hundreds of runs, so redoing it
differently for the sake of it does not look like good resource
allocation.
--
Fabien.
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2016-01-07 15:12:21 | Re: Very confusing installcheck behavior with PGXS |
Previous Message | Alvaro Herrera | 2016-01-07 15:01:40 | pgsql: Windows: Make pg_ctl reliably detect service status |