From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Subject: | Bufferless buffered files |
Date: | 2023-04-23 01:09:14 |
Message-ID: | CA+hUKGKCnU9NjFfzO219V-YeyWr8mZe4JRrf=x_uv6qsePBcOw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
"$SUBJECT" may sound like a Zen kōan, but buffile.c does a few useful
things other than buffering (I/O time tracking, inter-process file
exchange though file sets, segmentation, etc). A couple of places
such as logtape.c, sharedtuplestore.c and gistbuildbuffers.c do
block-oriented I/O in temporary files, but unintentionally also pay
the price of buffile.c's buffering scheme for no good reason, namely:
* extra memory for the buffer
* extra memcpy() to/from that buffer
* I/O chopped up into BLCKSZ-sized system calls
Melanie and I were bouncing some ideas around about the hash join
batch memory usage problem, and I thought it might be good to write
down these thoughts about lower level buffer management stuff in a new
thread. They don't address the hard algorithmic issues in that topic
(namely: at some point, doubling the number of partition files will
take more buffer space than can be saved by halving the size of the
hash table, parallel or not), but we probably should remove a couple
of the constants that make the parallel case worse. As a side benefit
it would be nice to also fix some obvious silly efficiency problems
for all users of BufFiles.
First, here is a toy patch to try to skip the buffering layer in
BufFile when possible, without changing any interfaces. Other ways
would of course be possible, including a new separate module with a
different name. The approach taken here is to defer allocation of
BufFile::buffer, and use a fast path that just passes I/O straight
through to the OS without touching the sides, until the first
non-block-aligned I/O is seen, at which point we switch to the less
efficient mode. Don't take this code too seriously, and I'm not sure
what to think about short reads/writes yet; it's just a cheap demo to
accompany this problem statement.
We could go further (and this one is probably more important).
SharedTuplestore works in 32KB chunks, which are the "parallel grain"
for distributing data to workers that read the data back later
(usually the workers try to read from different files to avoid
stepping on each others' toes, but when there aren't enough files left
they gang up and read from one file in parallel, and we need some way
to scatter the data to different workers; these chunks are analogous
to the heap table blocks handed out by Parallel Seq Scan). With the
bufferless-BufFile optimisation, data is written out directly from
sharedtuplestore.c's buffer to the OS in a 32KB pwrite(), which is a
nice improvement, but that's 32KB of buffer space that non-parallel HJ
doesn't have. We could fix that by making the chunk size dynamic, and
turning it down as low as 8KB when the number of partitions shoots up
to silly numbers by some heuristics. (I guess the buffer size and the
chunk size don't have to be strictly the same, but it'd probably make
the code simpler...) With those changes, PHJ would use strictly no
more buffer memory than HJ, unless npartitions is fairly low when it
should be OK to do so (and we could even consider making it larger
sometimes for bigger I/Os).
(Obviously it's just terrible in every way when we have very high
numbers of partitions, and this just mitigates one symptom. For
example, once we get to our max fds limit, we're re-opening and
re-closing files at high speed, which is bonkers, and the kernel has
an identical buffer space management problem on its side of the wall.
In the past we've talked about something a bit more logtape.c-like
(multiple streams in one file with our own extent management) which
might help a bit. But... it feels like what we really need is a
higher level algorithmic solution that lets us cap the number of
partitions, as chronicled elsewhere.)
We could go further again. The above concerns the phase where we're
writing data out to many partitions, which requires us to have
in-memory state for all of them. But during the reading back phase,
we read from just a single partition at a time, so we don't have the
memory explosion problem. For dumb implementation reasons
(essentially because it was styled on the older pre-parallel HJ code),
SharedTuplestore reads one tuple at a time from buffile.c with
BufFileRead(), so while reading it *needs* BufFile's internal
buffering, which means 8KB pread() calls, no optimisation for you. We
could change it to read in whole 32KB chunks at a time (or dynamic
size as mentioned already), via the above faster bufferless path. The
next benefit would be that, with whole chunks in *its* buffer, it
could emit MinimalTuples using a pointer directly into that buffer,
skipping another memcpy() (unless the tuple is too big and involves
overflow chunks, a rare case). (Perhaps there could even be some
clever way to read whole chunks directly into the hash table memory
chunks at once, instead of copying tuples in one-by-one, not
examined.)
Another closely related topic that came up in other recent I/O work:
How can we arrange for all of these buffers to be allocated more
efficiently? In fact we know how many partitions there are at a
higher level, and we'd ideally make one
palloc_aligned(PG_IO_ALIGN_SIZE) for all of them, instead of many
individual smaller pallocs. Aside from wasted CPU cycles, individual
pallocs have invisible space overheads at both libc and palloc level,
and non-I/O-aligned buffers have invisible costs at I/O system call
level (the kernel must hold/pin/something an extra neighbouring VM
page while copying in/out of user space, if not nicely aligned).
However, palloc_aligned()'s strategy is to over-allocate by 4KB and
then round the pointer up, so we'd ideally like to do that just once
for a whole array, not once for each BufFile (you can see this problem
in the attached patch, which does that that for individual BufFile
objects, but it'd make the problem Jehan-Guillaume wrote about
slightly worse, which is why I didn't do that in commit faeedbcef
"Introduce PG_IO_ALIGN_SIZE and align all I/O buffers."). This seems
to suggest a different design where the client code can optionally
provide its own buffer space, perhaps something as simple as
BufFileSetBuffer(), and we'd need to make one big allocation and pass
in pointers to its elements for the raw batch files (for HJ), and come
up with something similar for SharedTuplestore's internal chunk buffer
(for PHJ)).
Attachment | Content-Type | Size |
---|---|---|
0001-Skip-useless-buffering-in-BufFile.patch | text/x-patch | 8.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2023-04-23 01:24:02 | Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options |
Previous Message | Julien Rouhaud | 2023-04-22 23:33:30 | Re: Mark a transaction uncommittable |