Re: Flushing large data immediately in pqcomm

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Flushing large data immediately in pqcomm
Date: 2024-02-01 03:24:42
Message-ID: 20240201032442.v3vd52kzu3hynamf@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-01-31 14:57:35 -0500, Robert Haas wrote:
> > You're right and I'm open to doing more legwork. I'd also appreciate any
> > suggestion about how to test this properly and/or useful scenarios to
> > test. That would be really helpful.
>
> I think experimenting to see whether the long-short-long-short
> behavior that Heikki postulated emerges in practice would be a really
> good start.
>
> Another experiment that I think would be interesting is: suppose you
> create a patch that sends EVERY message without buffering and compare
> that to master. My naive expectation would be that this will lose if
> you pump short messages through that connection and win if you pump
> long messages through that connection. Is that true? If yes, at what
> point do we break even on performance? Does it depend on whether the
> connection is local or over a network? Does it depend on whether it's
> with or without SSL? Does it depend on Linux vs. Windows vs.
> whateverBSD? What happens if you twiddle the 8kB buffer size up or,
> say, down to just below the Ethernet frame size?

I feel like you're putting up a too high bar for something that can be a
pretty clear improvement on its own, without a downside. The current behaviour
is pretty absurd, doing all this research across all platforms isn't going to
disprove that - and it's a lot of work. ISTM we can analyze this without
taking concrete hardware into account easily enough.

One thing that I haven't seen mentioned here that's relevant around using
small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing
tiny sends can hurt substantially

> I think that what we really want to understand here is under what
> circumstances the extra layer of buffering is a win vs. being a loss.

It's quite easy to see that doing no buffering isn't viable - we end up with
tiny tiny TCP packets, one for each send(). And then there's the syscall
overhead.

Here's a quickly thrown together benchmark using netperf. First with -D, which
instructs it to use TCP_NODELAY, as we do.

10gbit network, remote host:

$ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do s=$((2**$i));netperf -P0 -t TCP_STREAM -l1 -H alap5-10gbe -- -r $s,$s -D 1 -o "$fields";done)|column -t -s,

request_size throughput
1 22.73
2 45.77
4 108.64
8 225.78
16 560.32
32 1035.61
64 2177.91
128 3604.71
256 5878.93
512 9334.70
1024 9031.13
2048 9405.35
4096 9334.60
8192 9275.33
16384 9406.29
32768 9385.52
65536 9399.40

localhost:
request_size throughput
1 2.76
2 5.10
4 9.89
8 20.51
16 43.42
32 87.13
64 173.72
128 343.70
256 647.89
512 1328.79
1024 2550.14
2048 4998.06
4096 9482.06
8192 17130.76
16384 29048.02
32768 42106.33
65536 48579.95

I'm slightly baffled by the poor performance of localhost with tiny packet
sizes. Ah, I see - it's the NODELA, without that:

localhost:
1 32.02
2 60.58
4 114.32
8 262.71
16 558.42
32 1053.66
64 2099.39
128 3815.60
256 6566.19
512 11751.79
1024 18976.11
2048 27222.99
4096 33838.07
8192 38219.60
16384 39146.37
32768 44784.98
65536 44214.70

NODELAY triggers many more context switches, because there's immediately data
available for the receiving side. Whereas with real network the interrupts get
coalesced.

I think that's pretty clear evidence that we need buffering. But I think we
can probably be smarter than we are right now, and then what's been proposed
in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its
own, it may trigger sending a small TCP packet, which is quite inefficient.

While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-02-01 03:26:25 RE: speed up a logical replica setup
Previous Message Euler Taveira 2024-02-01 03:05:23 Re: speed up a logical replica setup