From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | Re: AIO writes vs hint bits vs checksums |
Date: | 2024-11-01 18:10:54 |
Message-ID: | xbxhku5bpfyzq4b3kng32dtwhjtq6e4cmfjxxiabo434p6wadi@to4dfk4i4mok |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2024-10-30 09:58:30 -0400, Andres Freund wrote:
> On 2024-10-30 14:16:51 +0200, Heikki Linnakangas wrote:
> > Could we put the overhead on the FlushBuffer()
> > instead?
> >
> > How about something like this:
> >
> > To set hint bits:
> >
> > 1. Lock page in SHARED mode.
> > 2. Read BM_IO_IN_PROGRESS
> > 3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
> > 4. Unlock page
> >
> > To flush a buffer:
> >
> > 1. Lock page in SHARED mode
> > 2. Set BM_IO_IN_PROGRESS
> > 3. Read the lock count on the buffer lock, to see if we're the only locker.
> > 4. If anyone else is holding the lock, upgrade it to exclusive mode, and
> > immediately downgrade back to share mode.
> > 5. calculate CRC, flush the buffer
> > 6. Clear BM_IO_IN_PROGRESS and unlock page.
> >
> > This goes back to the idea of adding LWLock support for this, but the amount
> > of changes could be pretty small. The missing operation we don't have today
> > is reading the share-lock count on the lock in step 3. That seems simple to
> > add.
>
> I've played around with a bunch of ideas like this. There are two main
> reasons I didn't like them that much in the end:
>
> 1) The worst case latency impacts seemed to make them not that
> interesting. A buffer that is heavily contended with share locks might not
> get down to zero share lockers for quite a while. That's not a problem for
> individual FlushBuffer() calls, but it could very well add up to a decent
> sized delay for something like a checkpoint that has to flush a lot of
> buffers.
>
> Waiting for all pre-existing share lockers is easier said than done. We
> don't record the acquisition order anywhere and a share-lock release won't
> wake anybody if the lockcount doesn't reach zero. Changing that wouldn't
> exactly be free and the cost would be born by all lwlock users.
>
> 2) They are based on holding an lwlock. But it's actually quite conceivable
> that we'd want to set something hint-bit-like without any lock, as we
> e.g. currently do for freespace/. That would work with something like the
> approach I chose here, but not if we rely on lwlocks.
>
> E.g. with a bit of work we could actually do sequential scans without
> blocking concurrent buffer modifications by carefully ordering when
> pd_lower is modified and making sure that tuple header fields are written
> in a sensible order.
I still don't like this idea a whole lot - but perhaps we could get reduce the
overhead of my proposal some, to get closer to yours. When setting hint bits
for many tuples on a page the overhead of my approach is neglegible, but when doing it
for individual tuples it's a bit less neglegible.
We can reduce the efficiency difference substantially by adding a bufmgr.c API
that set hints on a page. That function can set the hint bit while holding the
buffer header lock, and therefore doesn't need to set BM_SETTING_HINTS and
thus also doesn't need to do resowner.c accounting.
To see the worst case overhead, I
a) disabling the "batch" optimization
b) disabled checksums, as that otherwise would hide small efficiency
differences
c) used an unlogged table
and measured the performance difference for a previously-unhinted sequential
scan of a narrow table that immediately discards all tuples due to OFFSET -
afaict the worst case for the proposed new behaviour.
Previously this was 30.8% slower than master. Now it's only 1.9% slower.
With the batch optimization enabled, the patchset is 7.5% faster.
I also looked at the performance impact on scans that cannot use the batched
approach. The worst case I could think of was a large ordered indexscan of a
previously unhinted table.
For an IOS, the performance difference is a slowdown of 0.65%.
But the difference being so small is partially just due to IOS being
considerably slower than a plain index scan when all tuples need to be fetched
from the table (we need to address that...). Forcing a non-IOS IOS scan using
enable_indexonlyscan, I get a slowdown of 5.0%.
Given that this is an intentionally pathological workload - there's no IO /
the workload is CPU bound, yet the data is only ever accessed once, with a
query that doesn't ever actually look at the data - I'm quite happy with that.
As soon as I make it a query that actually uses the data, the difference
vanishes in the noise.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2024-11-01 18:17:09 | Re: Collation & ctype method table, and extension hooks |
Previous Message | Kirill Reshke | 2024-11-01 18:06:35 | Re: Add missing tab completion for ALTER TABLE ADD COLUMN IF NOT EXISTS |