Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send

From: Rony Kurniawan <rony(dot)kurniawan(at)oracle(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send
Date: 2021-05-17 18:19:31
Message-ID: 3f60a4c4-71df-bce0-d743-ec06ff1b08fe@oracle.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 5/17/2021 9:27 AM, Andres Freund wrote:
> Hi,
>
> On 2021-05-17 16:21:36 +0300, Yura Sokolov wrote:
>> Andres Freund писал 2021-05-17 01:44:
>>> What kind of network is this? I would have expected that if the network
>>> can't keep up the small sends would end up getting aggregated into
>>> larger packets anyway? Are you hitting a PPS limit due to the small
>>> packages, but not yet the throughput limit?
>> I believe the reason is more in sys-call and kernel cpu time overhead than
>> in network throughput. Especially in this "after meltdown+spectre" time.
> Well, enabling Nagle's wouldn't change anything if the issue is just
> syscall and not network overhead. Yet the ask was to enable Nagle's...
>
> Greetings,
>
> Andres Freun

The networks that I tested were gigabits and docker (local). With
TCP_NODELAY enabled, the only time small sends would be aggregated is by
auto corking in tcp/ip when there is network congestion. But as you can
see from the tcpdump output the messages are in individual packet
therefore there is no aggregation and no network congestion.

There is network overhead in both sender and receiver like tcp/ip
header, number of skb, ethernet tx/rx descriptors, and interrupts. Also
syscall overhead in pg_recvlogical where for one insert in the example
requires 3 recv() calls to read BEGIN, INSERT, COMMIT messages instead
of one recv() to read all three messages when Nagle's is enabled. This
syscall overhead is the same in transaction case with multiple changes
where each change is one recv().

I agree that in some cases low latency in replication is required, but
there are also cases where high throughput is desired especially when
the standby server is behind due to outage where latency doesn't exist.

I experimented by simply disabling TCP_NODELAY in
walsender.c:StartLogicalReplication() and the throughput went up by 60%.
This is just a prove of concept that some kind of message aggregations
would result to higher throughput.

Thank you,

Rony

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andres Freund 2021-05-17 18:54:39 Re: [External] : Re: BUG #17005: Enhancement request: Improve walsender throughput by aggregating multiple messages in one send
Previous Message Peter Geoghegan 2021-05-17 18:17:12 Re: BUG #16833: postgresql 13.1 process crash every hour