From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Use generation memory context for tuplestore.c |
Date: | 2024-05-03 13:55:22 |
Message-ID: | CAApHDvp5Py9g4Rjq7_inL3-MCK1Co2CRt_YWFwTU2zfQix0p4A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
(40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)
I was looking at the tuplestore.c code a few days ago and noticed that
it allocates tuples in the memory context that tuplestore_begin_heap()
is called in, which for nodeMaterial.c, is ExecutorState.
I didn't think this was great because:
1. Allocating many chunks in ExecutorState can bloat the context with
many blocks worth of free'd chunks, stored on freelists that might
never be reused for anything.
2. To clean up the memory, pfree must be meticulously called on each
allocated tuple
3. ExecutorState is an aset.c context which isn't the most efficient
allocator for this purpose.
I've attached 2 patches:
0001: Adds memory tracking to Materialize nodes, which looks like:
-> Materialize (actual time=0.033..9.157 rows=10000 loops=2)
Storage: Memory Maximum Storage: 10441kB
0002: Creates a Generation MemoryContext for storing tuples in tuplestore.
Using generation has the following advantages:
1. It does not round allocations up to the next power of 2. Using
generation will save an average of 25% memory for tuplestores or allow
an average of 25% more tuples before going to disk.
2. Allocation patterns in tuplestore.c are FIFO, which is exactly what
generation was designed to handle best.
3. Generation is faster to palloc/pfree than aset. (See [1]. Compare
the 4-bit times between aset_palloc_pfree.png and
generation_palloc_pfree.png)
4. tuplestore_clear() and tuplestore_end() can reset or delete the
tuple context instead of pfreeing every tuple one by one.
5. Higher likelihood of neighbouring tuples being stored consecutively
in memory, resulting in better CPU memory prefetching.
6. Generation has a page-level freelist, so is able to reuse pages
instead of freeing and mallocing another if tuplestore_trim() is used
to continually remove no longer needed tuples. aset.c can only
efficiently do this if the tuples are all in the same size class.
The attached bench.sh.txt tests the performance of this change and
result_chart.png shows the results I got when running on an AMD 3990x
master @ 8f0a97dff vs patched.
The script runs benchmarks for various tuple counts stored in the
tuplestore -- 1 to 8192 in power-2 increments.
The script does output the memory consumed by the tuplestore for each
query. Here are the results for the 8192 tuple test:
master @ 8f0a97dff
Storage: Memory Maximum Storage: 16577kB
patched:
Storage: Memory Maximum Storage: 8577kB
Which is roughly half, but I did pad the tuple to just over 1024
bytes, so the alloc set allocations would have rounded up to 2048
bytes.
Some things I've *not* done:
1. Gone over other executor nodes which use tuplestore to add the same
additional EXPLAIN output. CTE Scan, Recursive Union, Window Agg
could get similar treatment.
2. Given much consideration for the block sizes to use for
GenerationContextCreate(). (Maybe using ALLOCSET_SMALL_INITSIZE for
the start size is a good idea.)
3. A great deal of testing.
I'll park this here until we branch for v18.
David
Attachment | Content-Type | Size |
---|---|---|
bench.sh.txt | text/plain | 848 bytes |
result_chart.png | image/png | 78.5 KB |
v1-0001-Add-memory-disk-usage-for-Material-in-EXPLAIN-ANA.patch | application/octet-stream | 13.6 KB |
v1-0002-Don-t-use-ExecutorState-memory-context-for-tuples.patch | application/octet-stream | 4.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2024-05-03 13:57:35 | Re: Tarball builds in the new world order |
Previous Message | Alexander Korotkov | 2024-05-03 13:32:25 | Re: Add SPLIT PARTITION/MERGE PARTITIONS commands |