Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send

From: Rony Kurniawan <rony(dot)kurniawan(at)oracle(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send
Date: 2021-05-17 22:45:41
Message-ID: 5329e8fe-7c90-0a69-af97-0a4928a70b29@oracle.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs


On 5/17/2021 11:54 AM, Andres Freund wrote:
> Hi,
>
> On 2021-05-17 11:19:31 -0700, Rony Kurniawan wrote:
>> The networks that I tested were gigabits and docker (local). With
>> TCP_NODELAY enabled, the only time small sends would be aggregated is by
>> auto corking in tcp/ip when there is network congestion. But as you can see
>> from the tcpdump output the messages are in individual packet therefore
>> there is no aggregation and no network congestion.
> I don't understand why "individual packages" implies that there can be
> no network congestion? Or are you just saying that in the specific
> period traced you didn't observe that?

Since TCP_NODELAY=0 in PosgreSQL then it is up to the kernel to
aggregate those sends. In case of auto corking, it happens when the NIC
has outstanding packet in the tx queue due to network congestion or the
NIC can not catch up with the amount of send() by the application.

On a gigabit ethernet, the amount of data produced by the logical
replication server is not enough to trigger auto corking or other
aggregation hence the individual packet per message. Although,
aggregation could still happened sometimes.

In my bigger test case using pgbench to insert 20 records/transaction
for 1 minute, I see some bigger packets but they are mostly 629 bytes.

> I just verified this with iperf - I see large packets with
> iperf -l 500 --nodelay -c $other_host
> but not
> iperf -b 10M -l 500 --nodelay -c $other_host
>
> I had to remember how to disable tcp segmentation offloading to see
> proper package sizes in the first case, without there were a lot of
> 65226 byte sized packets in the first case...
>
>> There is network overhead in both sender and receiver like tcp/ip header,
>> number of skb, ethernet tx/rx descriptors, and interrupts.
> Right.
>
>
>> Also syscall overhead in pg_recvlogical where for one insert in the
>> example requires 3 recv() calls to read BEGIN, INSERT, COMMIT messages
>> instead of one recv() to read all three messages when Nagle's is
>> enabled. This syscall overhead is the same in transaction case with
>> multiple changes where each change is one recv().
> I think the obvious and unproblematic improvement is to only send data
> to the socket if WalSndWriteData's last_write parameter is set, or if
> there's a certain amount of data in the socket. That'll only get rid of
> some of the overhead, since we'd still send things like transactions
> separately.
>
> Another improvement might be that WalSndWriteData() possibly shouldn't
> block even if pq_is_send_pending() and the pending amount isn't huge,
> iff !last_write. That way we'd end up doing syscalls sending more data
> at once.

Thank you for looking into this,

Rony

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2021-05-18 07:42:03 BUG #17016: Cannot sync pgdg-common repo with reposync due to failed signature check
Previous Message Devrim Gündüz 2021-05-17 20:33:56 Re: BUG #17013: All RH6 repos are missing repomod.xml.asc files.