Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send

From: Andres Freund <andres(at)anarazel(dot)de>
To: Rony Kurniawan <rony(dot)kurniawan(at)oracle(dot)com>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send
Date: 2021-05-17 18:54:39
Message-ID: 20210517185439.6s4s5xz572mufs2e@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2021-05-17 11:19:31 -0700, Rony Kurniawan wrote:
> The networks that I tested were gigabits and docker (local). With
> TCP_NODELAY enabled, the only time small sends would be aggregated is by
> auto corking in tcp/ip when there is network congestion. But as you can see
> from the tcpdump output the messages are in individual packet therefore
> there is no aggregation and no network congestion.

I don't understand why "individual packages" implies that there can be
no network congestion? Or are you just saying that in the specific
period traced you didn't observe that?

I just verified this with iperf - I see large packets with
iperf -l 500 --nodelay -c $other_host
but not
iperf -b 10M -l 500 --nodelay -c $other_host

I had to remember how to disable tcp segmentation offloading to see
proper package sizes in the first case, without there were a lot of
65226 byte sized packets in the first case...

> There is network overhead in both sender and receiver like tcp/ip header,
> number of skb, ethernet tx/rx descriptors, and interrupts.

Right.

> Also syscall overhead in pg_recvlogical where for one insert in the
> example requires 3 recv() calls to read BEGIN, INSERT, COMMIT messages
> instead of one recv() to read all three messages when Nagle's is
> enabled. This syscall overhead is the same in transaction case with
> multiple changes where each change is one recv().

I think the obvious and unproblematic improvement is to only send data
to the socket if WalSndWriteData's last_write parameter is set, or if
there's a certain amount of data in the socket. That'll only get rid of
some of the overhead, since we'd still send things like transactions
separately.

Another improvement might be that WalSndWriteData() possibly shouldn't
block even if pq_is_send_pending() and the pending amount isn't huge,
iff !last_write. That way we'd end up doing syscalls sending more data
at once.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Devrim Gündüz 2021-05-17 20:33:56 Re: BUG #17013: All RH6 repos are missing repomod.xml.asc files.
Previous Message Rony Kurniawan 2021-05-17 18:19:31 Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send