Re: can we mark upper/lower/textlike functions leakproof?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: can we mark upper/lower/textlike functions leakproof?
Date: 2024-07-31 17:39:45
Message-ID: CA+Tgmob78LN+Tkgms=dQyBsLyGWBuehv7_kEvBfjwmj6fDsr7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 31, 2024 at 9:14 AM Joe Conway <mail(at)joeconway(dot)com> wrote:
> In my opinion, for this use case and others, it should be possible to
> redact the values substituted into log messages based on some criteria.
> One of those criteria could be "I am in a leakproof call right now". In
> fact in a similar fashion, an extension ought to be able to mutate the
> log message based on the entire string, e.g. when "ALTER
> ROLE...PASSWORD..." is spotted I would like to be able to redact
> everything after "PASSWORD".

This might be helpful, and unfortunately I'm all too familiar with the
ALTER ROLE...PASSWORD example, but I think it's not really related to
the question of whether we can mark upper() and lower() leakproof.

If there are some inputs that cause upper() and lower() to fail and
others that do not, the functions aren't leakproof, because an
attacker can extract information about values that they can't see by
feeding those values into these functions and seeing whether they get
a failure or not. It doesn't matter what error message is produced;
the fact that the function throws an error of any kind for some input
values but enough for others is enough to make it unsafe, and it seems
to me that we've repeatedly found that there's often a way to turn
even what seems like a small leak into a very large one, so we need to
be quite careful here.

Annoyingly, we have a WHOLE BUNCH of different code paths that need to
be assessed individually here. I think we can ignore the fact that
upper() can throw errors when it's unclear which collation to use; I
think that's a property of the query string rather than the input
value. It's a bit less clear whether we can ignore out of memory
conditions, but my judgement is that a possible OOM from a small
allocation is not generally going to be useful as an attack vector. If
a long string produces an error and a short one doesn't, that might
qualify as a real leak. And other errors that depend on the input
value are also leaks. So let's go through the code paths:

- When lc_ctype_is_c(), we call asc_toupper(), which calls
pg_ascii_toupper(), which just replaces a-z with A-Z. No errors.

- When mylocale->provider == COLLPROVIDER_ICU, we call icu_to_uchar()
and icu_convert_case(). icu_to_uchar() calls uchar_length(), which has
this:

if (U_FAILURE(status) && status != U_BUFFER_OVERFLOW_ERROR)
ereport(ERROR,
(errmsg("%s failed: %s",
"ucnv_toUChars", u_errorName(status))));

I don't know what errors can be reported here, or if this ever fails
in practice, so I can't prove it never fails consistently for some
particular input string. uchar_convert() and icu_convert_case() have
similar error reporting.

- When mylocale->provider == COLLPROVIDER_BUILTIN, we call
unicode_strupper() which calls unicode_to_utf8() which clearly throws
no errors. Likewise unicode_utf8len() does not error. I don't see how
we can get an error out of this path.

- In cases not covered by the above, we take different paths depending
on whether a multi-byte encoding is in use. In single-byte encodings,
we rely on either pg_toupper() or toupper_l(). The latter is an
OS-provided function so can't ereport(), and the former calls either
does the work itself or calls toupper() which again is an OS-provided
function and can't report().

- Finally, when mylocale == NULL or the provider is COLLPROVIDER_LIBC
and the encoding is multi-byte, we use char2wchar(), then towupper_l()
or towupper(), then wchar2char(). The whole thing can fall apart if
the string is too long, which might be enough to declare this
leakproof but it depends on whether the error guarded by /* Overflow
paranoia */ is really just paranoia or whether it's actually
reachable. Otherwise, towupper*() won't ereport because it's not part
of PG, so we need to assess char2wchar() and wchar2char(). Here I note
that char2wchar() claims that it does an ereport() if the input is
invalidly encoded. This is kind of surprising when you think about it,
because our usual policy is not to allow invalidly encoded data into
the database in the first place, but whoever wrote this thought it
would be possible to hit this case "if LC_CTYPE locale is different
from the database encoding." But it seems that the logic here dates to
commit 2ab0796d7a3a7116a79b65531fd33f1548514b52 back in 2011, so it
seems at least possible to me that we've tightened things up enough
since then that you can't actually hit this any more. But then again,
maybe not.

So in summary, I think upper() is ... pretty close to leakproof. But
if ICU sometimes fails on certain strings, then it isn't. And if the
multi-byte libc path can be made to fail reliably either with really
long strings or with certain choices of the LC_CTYPE locale, then it
isn't.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christophe Pettus 2024-07-31 17:55:33 Re: CI, macports, darwin version problems
Previous Message Andrew Dunstan 2024-07-31 16:58:37 Re: Recent 027_streaming_regress.pl hangs