From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jack Christensen <jack(at)jncsoftware(dot)com>, David Fetter <david(at)fetter(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Why is pq_begintypsend so slow? |
Date: | 2020-06-03 18:10:50 |
Message-ID: | 20200603181050.siprrxypltms5zbp@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2020-06-03 11:30:42 -0400, Robert Haas wrote:
> I too have seen recent benchmarking data where this was a big problem.
> Basically, you need a workload where the server doesn't have much or
> any actual query processing to do, but is just returning a lot of
> stuff to a really fast client - e.g. a locally connected client.
> That's not necessarily the most common case but, if you have it, all
> this extra copying is really pretty expensive.
Even when the query actually is doing something, it's still quite
possible to get the memcpies to be be measurable (say > 10% of
cycles). Obviously not in a huge aggregating query. Even in something
like pgbench -M prepared -S, which is obviously spending most of its
cycles elsewhere, the patches upthread improve throughput by ~1.5% (and
that's not eliding several unnecessary copies).
> My first thought was to wonder about changing all of our send/output
> functions to write into a buffer passed as an argument rather than
> returning something which we then have to copy into a different
> buffer, but that would be a somewhat painful change, so it is probably
> better to first pursue the idea of getting rid of some of the other
> copies that happen in more centralized places (e.g. printtup).
For those I think the allocator overhead is the bigger issue than the
memcpy itself. I wonder how much we could transparently hide in
pq_begintypsend()/pq_endtypsend().
> I
> wonder if we could replace the whole
> pq_beginmessage...()/pq_send....()/pq_endmessage...() system with
> something a bit better-designed. For instance, suppose we get rid of
> the idea that the caller supplies the buffer, and we move the
> responsibility for error recovery into the pqcomm layer. So you do
> something like:
>
> my_message = xyz_beginmessage('D');
> xyz_sendint32(my_message, 42);
> xyz_endmessage(my_message);
>
> Maybe what happens here under the hood is we keep a pool of free
> message buffers sitting around, and you just grab one and put your
> data into it.
Why do we need multiple buffers? ISTM we don't want to just send
messages at endmsg() time, because that implies unnecessary syscall
overhead. Nor do we want to imply the overhead of the copy from the
message buffer to the network buffer.
To me that seems to imply that the best approach would be to have
PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
record the starting position of the current message somewhere
(->cursor?). When an error is thrown, we reset the position to be where
the in-progress message would have begun.
I've previously outlined a slightly more complicated scheme, where we
have "proxy" stringinfos that point into another stringinfo, instead of
their own buffer. And know how to resize the "outer" buffer when
needed. That'd have some advantages, but I'm not sure it's really
needed.
There's some disadvantages with what I describe above, in particular
when dealing with send() sending only parts of our network buffer. We
couldn't cheaply reuse the already sent memory in that case.
I've before wondered / suggested that we should have StringInfos not
insist on having one consecutive buffer (which obviously implies needing
to copy contents when growing). Instead it should have a list of buffers
containing chunks of the data, and never copy contents around while the
string is being built. We'd only allocate a buffer big enough for all
data when the caller actually wants to have all the resulting data in
one string (rather than using an API that can iterate over chunks).
For the network buffer case that'd allow us to reuse the earlier buffers
even in the "partial send" case. And more generally it'd allow us to be
less wasteful with buffer sizes, and perhaps even have a small "inline"
buffer inside StringInfoData avoiding unnecessary memory allocations in
a lot of cases where the common case is only a small amount of data
being used. And I think the overhead while appending data to such a
stringinfo should be neglegible, because it'd just require the exact
same checks we already have to do for enlargeStringInfo().
> (4) seems unavoidable AFAIK.
Not entirely. Linux can do zero-copy sends, but it does require somewhat
complicated black magic rituals. Including more complex buffer
management for the application, because the memory containing the
to-be-sent data cannot be reused until the kernel notifies that it's
done with the buffer.
See https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html
That might be something worth pursuing in the future (since it, I think,
basically avoids spending any cpu cycles on copying data around in the
happy path, relying on DMA instead), but I think for now there's much
bigger fish to fry.
I am hoping that somebody will write a nicer abstraction for zero-copy
sends using io_uring. avoiding the need of a separate completion queue,
by simply only signalling completion for the sendmsg operation once the
buffer isn't needed anymore. There's no corresponding completion logic
for normal sendmsg() calls, so it makes sense that something had to be
invented before something like io_uring existed.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2020-06-03 18:19:45 | Atomic operations within spinlocks |
Previous Message | Jerome Wagner | 2020-06-03 17:28:12 | question regarding copyData containers |