Re: [PoC] Federated Authn/z with OAUTHBEARER

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Daniel Gustafsson <daniel(at)yesql(dot)se>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Date: 2024-09-16 19:13:28
Message-ID: CAOYmi+mhKa96y7pbpKEOQJpPB=ekgjJY=OrRtDwiOBxrnkrQBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 11, 2024 at 3:54 PM Jacob Champion
<jacob(dot)champion(at)enterprisedb(dot)com> wrote:
> Yeah, and I still owe you all an updated roadmap.

Okay, here goes. New reviewers: start here!

== What is This? ==

OAuth 2.0 is a way for a trusted third party (a "provider") to tell a
server whether a client on the other end of the line is allowed to do
something. This patchset adds OAuth support to libpq with libcurl,
provides a server-side API so that extension modules can add support
for specific OAuth providers, and extends our SASL support to carry
the OAuth access tokens over the OAUTHBEARER mechanism.

Most OAuth clients use a web browser to perform the third-party
handshake. (These are your "Okta logins", "sign in with XXX", etc.)
But there are plenty of people who use psql without a local browser,
and invoking a browser safely across all supported platforms is
actually surprisingly fraught. So this patchset implements something
called device authorization, where the client will display a link and
a code, and then you can log in on whatever device is convenient for
you. Once you've told your provider that you trust libpq to connect to
Postgres on your behalf, it'll give libpq an access token, and libpq
will forward that on to the server.

== How This Fits, or: The Sales Pitch ==

The most popular third-party auth methods we have today are probably
the Kerberos family (AD/GSS/SSPI) and LDAP. If you're not already in
an MS ecosystem, it's unlikely that you're using the former. And users
of the latter are, in my experience, more-or-less resigned to its use,
in spite of LDAP's architectural security problems and the fact that
you have to run weird synchronization scripts to tell Postgres what
certain users are allowed to do.

OAuth provides a decently mature and widely-deployed third option. You
don't have to be running the infrastructure yourself, as long as you
have a provider you trust. If you are running your own infrastructure
(or if your provider is configurable), the tokens being passed around
can carry org-specific user privileges, so that Postgres can figure
out who's allowed to do what without out-of-band synchronization
scripts. And those access tokens are a straight upgrade over
passwords: even if they're somehow stolen, they are time-limited, they
are optionally revocable, and they can be scoped to specific actions.

== Extension Points ==

This patchset provides several points of customization:

Server-side validation is farmed out entirely to an extension, which
we do not provide. (Each OAuth provider is free to come up with its
own proprietary method of verifying its access tokens, and so far the
big players have absolutely not standardized.) Depending on the
provider, the extension may need to contact an external server to see
what the token has been authorized to do, or it may be able to do that
offline using signing keys and an agreed-upon token format.

The client driver using libpq may replace the device authorization
prompt (which by default is done on standard error), for example to
move it into an existing GUI, display a scannable QR code instead of a
link, and so on.

The driver may also replace the entire OAuth flow. For example, a
client that already interacts with browsers may be able to use one of
the more standard web-based methods to get an access token. And
clients attached to a service rather than an end user could use a more
straightforward server-to-server flow, with pre-established
credentials.

== Architecture ==

The client needs to speak HTTP, which is implemented entirely with
libcurl. Originally, I used another OAuth library for rapid
prototyping, but the quality just wasn't there and I ported the
implementation. An internal abstraction layer remains in the libpq
code, so if a better client library comes along, switching to it
shouldn't be too painful.

The client-side hooks all go through a single extension point, so that
we don't continually add entry points in the API for each new piece of
authentication data that a driver may be able to provide. If we wanted
to, we could potentially move the existing SSL passphrase hook into
that, or even handle password retries within libpq itself, but I don't
see any burning reason to do that now.

I wanted to make sure that OAuth could be dropped into existing
deployments without driver changes. (Drivers will probably *want* to
look at the extension hooks for better UX, but they shouldn't
necessarily *have* to.) That has driven several parts of the design.

Drivers using the async APIs should continue to work without blocking,
even during the long HTTP handshakes. So the new client code is
structured as a typical event-driven state machine (similar to
PQconnectPoll). The protocol machine hands off control to the OAuth
machine during authentication, without really needing to know how it
works, because the OAuth machine replaces the PQsocket with a
general-purpose multiplexer that handles all of the HTTP sockets and
events. Once that's completed, the OAuth machine hands control right
back and we return to the Postgres protocol on the wire.

This decision led to a major compromise: Windows client support is
nonexistent. Multiplexer handles exist in Windows (for example with
WSAEventSelect, IIUC), but last I checked they were completely
incompatible with Winsock select(), which means existing async-aware
drivers would fail. We could compromise by providing synchronous-only
support, or by cobbling together a socketpair plus thread pool (or
IOCP?), or simply by saying that existing Windows clients need a new
API other than PQsocket() to be able to work properly. None of those
approaches have been attempted yet, though.

== Areas of Concern ==

Here are the iffy things that a committer is signing up for:

The client implementation is roughly 3k lines, requiring domain
knowledge of Curl, HTTP, JSON, and OAuth, the specifications of which
are spread across several separate standards bodies. (And some big
providers ignore those anyway.)

The OAUTHBEARER mechanism is extensible, but not in the same way as
HTTP. So sometimes, it looks like people design new OAuth features
that rely heavily on HTTP and forget to "port" them over to SASL. That
may be a point of future frustration.

C is not really anyone's preferred language for implementing an
extensible authn/z protocol running on top of HTTP, and constant
vigilance is going to be required to maintain safety. What's more, we
don't really "trust" the endpoints we're talking to in the same way
that we normally trust our servers. It's a fairly hostile environment
for maintainers.

Along the same lines, our JSON implementation assumes some level of
trust in the JSON data -- which is true for the backend, and can be
assumed for a DBA running our utilities, but is absolutely not the
case for a libpq client downloading data from Some Server on the
Internet. I've been working to fuzz the implementation and there are a
few known problems registered in the CF already.

Curl is not a lightweight dependency by any means. Typically, libcurl
is configured with a wide variety of nice options, a tiny subset of
which we're actually going to use, but all that code (and its
transitive dependencies!) is going to arrive in our process anyway.
That might not be a lot of fun if you're not using OAuth.

It's possible that the application embedding libpq is also a direct
client of libcurl. We need to make sure we're not stomping on their
toes at any point.

== TODOs/Known Issues ==

The client does not deal with verification failure well at the moment;
it just keeps retrying with a new OAuth handshake.

Some people are not going to be okay with just contacting any web
server that Postgres tells them to. There's a more paranoid mode
sketched out that lets the connection string specify the trusted
issuer, but it's not complete.

The new code still needs to play well with orthogonal connection
options, like connect_timeout and require_auth.

The server does not deal well with multi-issuer setups yet. And you
only get one oauth_validator_library...

Harden, harden, harden. There are still a handful of inline TODOs
around double-checking certain pieces of the response before
continuing with the handshake. Servers should not be able to run our
recursive descent parser out of stack. And my JSON code is using
assertions too liberally, which will turn bugs into DoS vectors. I've
been working to fit a fuzzer into more and more places, and I'm hoping
to eventually drive it directly from the socket.

Documentation still needs to be filled in. (Thanks Daniel for your work here!)

== Future Features ==

There is no support for token caching (refresh or otherwise). Each new
connection needs a new approval, and the only way to change that for
v1 is to replace the entire flow. I think that's eventually going to
annoy someone. The question is, where do you persist it? Does that
need to be another extensibility point?

We already have pretty good support for client certificates, and it'd
be great if we could bind our tokens to those. That way, even if you
somehow steal the tokens, you can't do anything with them without the
private key! But the state of proof-of-possession in OAuth is an
absolute mess, involving at least three competing standards (Token
Binding, mTLS, DPoP). I don't know what's going to win.

--

Hope this helps! Next I'll be working to fold the patches together, as
discussed upthread.

Thanks,
--Jacob

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-09-16 19:13:47 Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Previous Message Alexander Korotkov 2024-09-16 18:55:50 Re: [HACKERS] make async slave to wait for lsn to be replayed