Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: lars(at)greiz-reinsdorf(dot)de
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection
Date: 2023-11-22 19:19:06
Message-ID: 2199742.1700680746@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

I wrote:
> PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
>> The error "GSSAPI caller failed to retransmit all data needing to be
>> retried" is raised here:
>> https://github.com/postgres/postgres/blob/eeb0ebad79d9350305d9e111fbac76e20fa4b2fe/src/interfaces/libpq/fe-secure-gssapi.c#L110
>> It happens only in non-blocking mode over GSSAPI encrypted connections. It
>> isn't reliable and depends on the network timing. When sending a 7MB file in
>> alternating pieces of 27KB and 180 Byte per PQputCopyData() there is a 50%
>> chance to get the failure over the local network. It doesn't happen if TLS
>> is used instead.

> A repro script would be really really helpful here.

After consuming more caffeine, I was able to repro it with the
not-intended-for-commit hack in 0001 attached. (On my machine,
the test just hangs up upon failing, because the hacked-up
logic in copy.c doesn't cope very well with the failure. I don't
think that is copy.c's fault though, it's just an incomplete hack.)

I concur with the conclusion that it's really pqPutMsgEnd's fault.
By deciding not to send the last partial block that's in the
outBuffer, it runs the risk of not presenting some data that it did
present the last time, and that can trigger the "failed to retransmit
all data" error. This happens because GSS's gss_MaxPktSize is a bit
less than 16K (it's 16320 on my machine). So if we initially present
24K of data (3 blocks), and pg_GSS_write successfully encrypts and
sends one packet of data, then it will encrypt all the rest. But if
its second pqsecure_raw_write call fails with EINTR, it will return
with bytes_sent = 16320 (and PqGSSSendConsumed = 8256), causing us to
reduce the outBuffer contents to 8256 bytes plus whatever partial
block we didn't try to send. If we don't fill outBuffer to at least
16K before trying again, we'll try to send just 8192 bytes, and
kaboom. (This is why the alternating-long-and-short-lines business
is important.)

The quick hack in 0002 attached fixes it, but I can't say that
I like this solution: it's propagating a bit of ugliness that
ought to be localized in pg_GSS_write out to callers.

I wonder if we should drop the idea of returning a positive bytecount
after a partial write, and just return the pqsecure_raw_write result,
and not reset PqGSSSendConsumed until we write everything presented.
In edge cases maybe that would result in some buffer bloat, but it
doesn't seem worse than what happens when the very first
pqsecure_raw_write returns EINTR.

In any case, the backend needs a look to see whether it requires a
similar fix. We don't do nonblock mode there, but I don't think
that means we can never get EINTR.

regards, tom lane

Attachment Content-Type Size
0001-hacky-test-case.patch text/x-diff 2.5 KB
0002-fix-gssapi-chunking-failure.patch text/x-diff 963 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Lars Kanis 2023-11-22 20:21:47 Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection
Previous Message Tom Lane 2023-11-22 15:11:26 Re: BUG #18210: libpq: PQputCopyData sometimes fails in non-blocking mode over GSSAPI encrypted connection