COPY, bytea streaming and memory footprint

From: Jerome Wagner <jerome(dot)wagner(at)laposte(dot)net>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: COPY, bytea streaming and memory footprint
Date: 2020-06-10 12:52:02
Message-ID: CA+=V_fNkQyZFcUo2XEwFz95=9eaOUViyNcW0JbgiTdy8RsqcFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I am trying to understand/optimize how a COPY operation behaves when
transfering a bytea from the database to a client.

For simplicity, I'll consider that I have only one bytea _image_ in the
_images_ table.

Starting with
COPY (SELECT image FROM images) TO STDOUT BINARY

I understand that :
- the server will create a linear buffer on the server side holding the
whole image and then start sending it over the network in one big copyData
message chunked in 64KB network chunks
- the client can manage to extract this copyData payload by re-assembling
those chunks in memory or by streaming the relevant data parts of the
chunks elsewhere.

so the problem I see in a streaming situation is that the server actually
needs to buffer the whole image in memory.

Now the image is already compressed so if I
ALTER TABLE images ALTER image SET STORAGE EXTERNAL
I can use the fact that substring on non compressed toasted values will
fetch only the needed parts and do

COPY (
SELECT (
SELECT substring(image from n for 65536) from images)
FROM generate_series(1, (select length(image) from images), 65536) n
) TO STDOUT BINARY

As I understand it, this would be less memory intensive on the server side
if the server starts sending rows before all rows of the subselect are
built because it would only need to prepare a sequence of 65536 bytes long
buffers for the rows it would decide to have in memory.

but is there a way to know if such a COPY/SELECT statement will indeed
start sending rows before they are all prepared on the server ? Does it
depend on the request and is there a difference if I add an order by on
the select versus the natural order of the table ?
How many rows will be needed in memory before the sending begins ?

I hope my explanation was clear. I am looking for help in better
understanding how the server decides to stream the COPY data out of the
server vs the internal retrieval of the COPY'd subselect.

Thank you
Jérôme

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2020-06-10 12:58:34 Recording test runtimes with the buildfarm
Previous Message David Rowley 2020-06-10 12:33:45 Re: Parallel Seq Scan vs kernel read ahead