Proposal to add a new URL data type.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposal to add a new URL data type.
Date: 2024-12-05 14:01:12
Message-ID: bc059d70-48d2-4afa-87f8-bdf7622736fe@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone,

My name is Alexander Borisov, I want to propose/discuss
adding a new URL type as an extension in PostgreSQL contrib.
I think everyone here knows/seen what URL/URI and their basics.
If someone is interested in the basics, you can read the original
RFC 1738 [1].

For what, who?

Postgres users often store URLs in the database. As an example, they
provide links to their pages on the web, analyze users posts and get
links for further storage and analysis. Naturally, there is a need to
compare, sort, build indexes, get statistics on individual parts of
the URL: hosts, protocols, path and so on.

Adding a new URL type will standardize the work with URLs in Postgres
and provide simple tools for manipulating the new type.

URL in the world

URL is one of the basic concepts of the web, it makes it possible to
refer to certain resources on the web (and not only). At the moment
there are two main directions in specifications:

RFC: RFC 3986 [2] by IETF.
WHATWG: WHATWG URL [3], WHATWG community by Apple, Google, Mozilla,
Microsoft. From the developers of modern browsers.

It's worth noting that the WHATWG URL is a living standard (like HTML).
It changes, improves, fixes bugs, but it doesn't break backwards
compatibility.

What is the main difference between WHATWG and RFC 3986?

There is a difference, and it's not that significant, unless host
parsing. To begin with, let's define that a full URL has the format:

Format: [scheme][divider][user:pass][host][port][path][query][fragment]
Example: https://root:qwerty(at)example(dot)com:8080/path/to?abc=xyz#fragment

Now we can look at SOME of the differences in parsing:

[user:password]
Source: https://root:qwerty(at)1234@example.com/
RFC 3986: user: root, password: qwerty.
Parses until it encounters the first '@' character.
WHATWG: user: root, password: qwerty(at)1234(dot)
Parses until it encounters the last '@' character.

[host]
Source: https://exаmple.com/ (а — U+0430)
RFC 3986: https://exаmple.com/.
Considers that the host is already Punycode encoded.
WHATWG: https://xn--exmple-4nf.com/.
Uses Unicode TR46 (IDNA, NFC normalization, convert non-ASCII
characters into Punycode, validation and so on).

[path]
Source: https://example.com/a/./b/../c
RFC 3986: https://example.com/a/./b/../c.
Cannot contain spaces (U+0020). U+005C Reverse solidus (\) is not
treated as U+002F Forward slash (/).
WHATWG: https://example.com/a/c.
Whitespace is encoded in %20 (URL-encode). Reverse solidus (\) is
treated as Forward slash (/).

We can see that there are differences, and the most significant ones
are related to host parsing and path normalization.

Let's take a look at what specifications popular programming languages
and applications rely on:

Node.js: WHATWG URL (standart module, node:url).
Before that, RFC 3986 was used, now it's a legacy API.
Python 3: RFC 3986 (standart library, urllib).
JAVA: RFC 3986 (standart class, java.net.URL).
PHP: In a future version 8.5, want to implement Uri\WhatWgUri,
Uri\Rfc3986Uri.
C++: WHATWG URL. ADA URL Parser is a popular one.
Rust: WHATWG URL (standart crate, url).
WebKit, Blink, Gecko: WHATWG URL.
curl: RFC 3986.

I could argue that modern and popular programming languages choose the
WHATWG specification. PHP thought to implement two approaches at once.
Browser engines understandably use the WHATWG specification, but they
write the specification themselves.

Proposal

I propose to add a new data type for PostgreSQL as an extension, in
contrib. Name the new type URL and use the WHATWG URL specification to
implement the new type. The choice of URL parsing specification is
justified by the following factors:
1. Live specification, adopts to modern realities.
2. The developers of modern browsers are behind the specification.
3. Increasing popularity.

Implementation

Attached to this email is a patch that contains:
1. URL type implementation with functions to get and modify individual
parts of the URL.
2. Code to implement URL parsing according to WHATWG specification.
Don't be alarmed, it has a different style than Postgres.

This is TEST THE WATER, for the opportunity to visualize, try, discuss.

I should note that the code of URL parsing implementation is taken from
my project (written by me, open source), so it differs in style from
the one adopted in Postgres.
Initially the code has its own implementation of Unicode TR46, as well
as its own encode/decode implementation for encodings (40 encodings).
But in order not to scare the audience with the size of the patch
I decided to use ICU in the current version.

You can read more about the implemented functions in README and url.c.

Final implementation

I see the final implementation in one of the options:

Option one
Rewrite URL parsing for Postgres style and functions. Write an
implementation of Unicode TR46 (so as not to use ICU IDNA).
This is not a small task, but not rocket science.

Option two
Rewrite URL parsing for Postgres style and functions.
Use ICU Unicode TR46.

Option three
Use the code as it is, use ICU Unicode TR46.
The simplest task.

From myself

I am willing to take care of the implementation of the new data type
and its further support. If it's of interest to the community.

[1] https://datatracker.ietf.org/doc/html/rfc1738
[2] https://datatracker.ietf.org/doc/html/rfc3986
[3] https://url.spec.whatwg.org/

--
Alexander Borisov

Attachment Content-Type Size
0001-Add-url-data-type-to-contrib.patch text/plain 379.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2024-12-05 14:07:12 Re: generic plans and "initial" pruning
Previous Message Tomas Vondra 2024-12-05 13:53:09 Re: generic plans and "initial" pruning