Re: refactoring relation extension and BufferAlloc(), faster COPY

From: Muhammad Malik <muhammad(dot)malik1(at)hotmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Date: 2023-05-03 18:25:51
Message-ID: SJ0PR17MB5841FC5FE92FBF6FA76EA07BA66C9@SJ0PR17MB5841.namprd17.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want to use the same benchmarks.

Thanks,
Muhammad
________________________________
From: Andres Freund <andres(at)anarazel(dot)de>
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hackers(at)postgresql(dot)org <pgsql-hackers(at)postgresql(dot)org>; Thomas Munro <thomas(dot)munro(at)gmail(dot)com>; Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>; Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.

The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.

My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.

To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.

Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
does, nearly doubles the amount of writes. First the kernel ends up writing
out all the zeroed out buffers after a while, then when we write out the
actual buffer contents.

The best fix for that seems to be to optionally use posix_fallocate() to
reserve space, without dirtying pages in the kernel page cache. However, it
looks like that's only beneficial when extending by multiple pages at once,
because it ends up causing one filesystem-journal entry for each extension
on at least some filesystems.

I added 'smgrzeroextend()' that can extend by multiple blocks, without the
caller providing a buffer to write out. When extending by 8 or more blocks,
posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
used to extend the file.

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.

c) Contention on the FSM and the pages returned by it is a serious bottleneck
after a) and b).

The biggest issue is that the current bulk insertion logic in hio.c enters
all but one of the new pages into the freespacemap. That will immediately
cause all the other backends to contend on the first few pages returned the
FSM, and cause contention on the FSM pages itself.

I've, partially, addressed that by using the information about the required
number of pages from b). Whether we bulk insert or not, the number of pages
we know we're going to need for one heap_multi_insert() don't need to be
added to the FSM - we're going to use them anyway.

I've stashed the number of free blocks in the BulkInsertState for now, but
I'm not convinced that that's the right place.

If I revert just this part, the "concurrent COPY into unlogged table"
benchmark goes from ~240 tps to ~190 tps.

Even after that change the FSM is a major bottleneck. Below I included
benchmarks showing this by just removing the use of the FSM, but I haven't
done anything further about it. The contention seems to be both from
updating the FSM, as well as thundering-herd like symptoms from accessing
the FSM.

The update part could likely be addressed to some degree with a batch
update operation updating the state for multiple pages.

The access part could perhaps be addressed by adding an operation that gets
a page and immediately marks it as fully used, so other backends won't also
try to access it.

d) doing
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);

under the extension lock is surprisingly expensive on my two socket
workstation (but much less noticable on my laptop).

If I move the MemSet back under the extension lock, the "concurrent COPY
into unlogged table" benchmark goes from ~240 tps to ~200 tps.

e) When running a few benchmarks for this email, I noticed that there was a
sharp performance dropoff for the patched code for a pgbench -S -s100 on a
database with 1GB s_b, start between 512 and 1024 clients. This started with
the patch only acquiring one buffer partition lock at a time. Lots of
debugging ensued, resulting in [3].

The problem isn't actually related to the change, it just makes it more
visible, because the "lock chains" between two partitions reduce the
average length of the wait queues substantially, by distribution them
between more partitions. [3] has a reproducer that's entirely independent
of this patchset.

Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.

The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock). That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.

I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.

server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 58.63 207 50.22 242 54.35 224
2 32.67 372 25.82 472 27.30 446
4 22.53 540 13.30 916 14.33 851
8 15.14 804 7.43 1640 7.48 1632
16 14.69 829 4.79 2544 4.50 2718
32 15.28 797 4.41 2763 3.32 3710
64 15.34 794 5.22 2334 3.06 4061
128 15.49 786 4.97 2452 3.13 3926
256 15.85 768 5.02 2427 3.26 3769
512 16.02 760 5.29 2303 3.54 3471

test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 68.18 178 59.41 205 63.43 192
2 39.71 306 33.10 368 34.99 348
4 27.26 446 19.75 617 20.09 607
8 18.84 646 12.86 947 12.68 962
16 15.96 763 9.62 1266 8.51 1436
32 15.43 789 8.20 1486 7.77 1579
64 16.11 756 8.91 1367 8.90 1383
128 16.41 742 10.00 1218 9.74 1269
256 17.33 702 11.91 1023 10.89 1136
512 18.46 659 14.07 866 11.82 1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 63.27s 192 56.14 217 59.25 205
2 40.17s 303 29.88 407 31.50 386
4 27.57s 442 16.16 754 17.18 709
8 21.26s 573 11.89 1025 11.09 1099
16 21.25s 573 10.68 1141 10.22 1192
32 21.00s 580 10.72 1136 10.35 1178
64 20.64s 590 10.15 1200 9.76 1249
128 skipped
256 skipped
512 skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 71.89s 169 65.57 217 69.09 69.09
2 47.36s 257 36.22 407 38.71 38.71
4 33.10s 368 21.76 754 22.78 22.78
8 26.62s 457 15.89 1025 15.30 15.30
16 24.89s 489 17.08 1141 15.20 15.20
32 25.15s 484 17.41 1136 16.14 16.14
64 26.11s 466 17.89 1200 16.76 16.76
128 skipped
256 skipped
512 skipped

Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 34.14 357 28.04 434 29.46 413
2 22.67 537 14.42 845 14.75 826
4 16.63 732 7.62 1599 7.69 1587
8 13.48 904 4.36 2795 4.13 2959
16 14.37 848 3.78 3224 2.74 4493
32 14.79 823 4.20 2902 2.07 5974
64 14.76 825 5.03 2423 2.21 5561
128 14.95 815 4.36 2796 2.30 5343
256 15.18 802 4.31 2828 2.49 4935
512 15.41 790 4.59 2656 2.84 4327

s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.55 194 54.22 224
2 37.11 328 28.94 421
4 25.97 469 16.42 742
8 20.01 609 11.92 1022
16 19.55 623 11.05 1102
32 19.34 630 11.27 1081
64 19.07 639 12.04 1012
128 19.22 634 12.27 993
256 19.34 630 12.28 992
512 19.60 621 11.74 1038

test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.71 169 63.63 191
2 46.93 259 36.31 335
4 30.37 401 22.41 543
8 22.86 533 16.90 721
16 20.18 604 14.07 866
32 19.64 620 13.06 933
64 19.71 618 15.08 808
128 19.95 610 15.47 787
256 20.48 595 16.53 737
512 21.56 565 16.86 722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.65 194 55.74 218
2 40.25 302 29.45 413
4 27.37 445 16.26 749
8 22.07 552 11.75 1037
16 21.29 572 10.64 1145
32 20.98 580 10.70 1139
64 20.65 590 10.21 1193
128 skipped
256 skipped
512 skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.72 169 65.12 187
2 46.46 262 35.74 341
4 32.61 373 21.60 564
8 26.69 456 16.30 747
16 25.31 481 17.00 716
32 24.96 488 17.47 697
64 26.05 467 17.90 680
128 skipped
256 skipped
512 skipped

test: unlogged_small_files, format: binary, files: 1024, 9505MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 37.62 323 32.77 371
2 28.35 429 18.89 645
4 20.87 583 12.18 1000
8 19.37 629 10.38 1173
16 19.41 627 10.36 1176
32 18.62 654 11.04 1103
64 18.33 664 11.89 1024
128 18.41 661 11.91 1023
256 18.52 658 12.10 1007
512 18.78 648 11.49 1060

benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.

The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.

s_b=1GB
tps tps
clients master patched
1 49541 48805
2 85342 90010
4 167340 168918
8 308194 303222
16 524294 523678
32 649516 649100
64 932547 937702
128 908249 906281
256 856496 903979
512 764254 934702
1024 653886 925113
2048 569695 917262
4096 526782 903258

s_b=128MB:
tps tps
clients master patched
1 40407 39854
2 73180 72252
4 143334 140860
8 240982 245331
16 429265 420810
32 544593 540127
64 706408 726678
128 713142 718087
256 611030 695582
512 552751 686290
1024 508248 666370
2048 474108 656735
4096 448582 633040

As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2023-05-03 18:26:15 Re: issue with meson builds on msys2
Previous Message Alvaro Herrera 2023-05-03 18:17:32 Re: SQL/JSON revisited