From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com> |
Cc: | Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: checkpointer continuous flushing |
Date: | 2016-01-20 14:02:20 |
Message-ID: | 20160120140220.iidxqnkx73k2ahd5@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2016-01-20 11:13:26 +0100, Andres Freund wrote:
> On 2016-01-19 22:43:21 +0100, Andres Freund wrote:
> > On 2016-01-19 12:58:38 -0500, Robert Haas wrote:
> > I think the problem isn't really that it's flushing too much WAL in
> > total, it's that it's flushing WAL in a too granular fashion. I suspect
> > we want something where we attempt a minimum number of flushes per
> > second (presumably tied to wal_writer_delay) and, once exceeded, a
> > minimum number of pages per flush. I think we even could continue to
> > write() the data at the same rate as today, we just would need to reduce
> > the number of fdatasync()s we issue. And possibly could make the
> > eventual fdatasync()s cheaper by hinting the kernel to write them out
> > earlier.
> >
> > Now the question what the minimum number of pages we want to flush for
> > (setting wal_writer_delay triggered ones aside) isn't easy to answer. A
> > simple model would be to statically tie it to the size of wal_buffers;
> > say, don't flush unless at least 10% of XLogBuffers have been written
> > since the last flush. More complex approaches would be to measure the
> > continuous WAL writeout rate.
> >
> > By tying it to both a minimum rate under activity (ensuring things go to
> > disk fast) and a minimum number of pages to sync (ensuring a reasonable
> > number of cache flush operations) we should be able to mostly accomodate
> > the different types of workloads. I think.
>
> This unfortunately leaves out part of the reasoning for the above
> commit: We want WAL to be flushed fast, so we immediately can set hint
> bits.
>
> One, relatively extreme, approach would be to continue *writing* WAL in
> the background writer as today, but use rules like suggested above
> guiding the actual flushing. Additionally using operations like
> sync_file_range() (and equivalents on other OSs). Then, to address the
> regression of SetHintBits() having to bail out more often, actually
> trigger a WAL flush whenever WAL is already written, but not flushed.
> has the potential to be bad in a number of other cases tho :(
Chatting on IM with Heikki, I noticed that we're pretty pessimistic in
SetHintBits(). Namely we don't set the bit if XLogNeedsFlush(commitLSN),
because we can't easily set the LSN. But, it's actually fairly common
that the pages LSN is already newer than the commitLSN - in which case
we, afaics, just can go ahead and set the hint bit, no?
So, instead of
if (XLogNeedsFlush(commitLSN) && BufferIsPermanent(buffer)
return; /* not flushed yet, so don't set hint */
we do
if (BufferIsPermanent(buffer) && XLogNeedsFlush(commitLSN)
&& BufferGetLSNAtomic(buffer) < commitLSN)
return; /* not flushed yet, so don't set hint */
In my tests with pgbench -s 100, 2GB of shared buffers, that's recovers
a large portion of the hint writes that we currently skip.
Right now, on my laptop, I get (-M prepared -c 32 -j 32):
current wal-writer 12827 tps, 95 % IO util, 93 % CPU
no flushing in wal writer * 13185 tps, 46 % IO util, 93 % CPU
no flushing in wal writer & above change 16366 tps, 41 % IO util, 95 % CPU
flushing in wal writer & above change: 14812 tps, 94 % IO util, 95 % CPU
* sometimes the results initially were much lower, with lots of lock
contention. Can't figure out why that's only sometimes the case. In
those cases the results were more like 8967 tps.
these aren't meant as thorough benchmarks, just to provide some
orientation.
Now that solution won't improve every situation, e.g. for a workload
that inserts a lot of rows in one transaction, and only does inserts, it
probably won't do all that much. But it still seems like a pretty good
mitigation strategy. I hope that with a smarter write strategy (getting
that 50% reduction in IO util) and the above we should be ok.
Andres
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2016-01-20 14:17:21 | Re: GIN pending list clean up exposure to SQL |
Previous Message | Ashutosh Bapat | 2016-01-20 13:53:05 | Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs) |