Re: Proposal to add a new URL data type.

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal to add a new URL data type.
Date: 2024-12-06 12:59:48
Message-ID: 79547330-ab46-4886-9fdf-cf7fa526c876@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

05.12.2024 17:59, Peter Eisentraut пишет:
> On 05.12.24 15:01, Alexander Borisov wrote:
>> Postgres users often store URLs in the database.  As an example, they
>> provide links to their pages on the web, analyze users posts and get
>> links for further storage and analysis.  Naturally, there is a need to
>> compare, sort, build indexes, get statistics on individual parts of
>> the URL: hosts, protocols, path and so on.
>>
>> Adding a new URL type will standardize the work with URLs in Postgres
>> and provide simple tools for manipulating the new type.
>
> Have you seen this: https://github.com/petere/pguri ?
>
> The difference there is that it uses an external library for parsing the
> URLs, which makes the code much smaller.  Overall, the functionality
> looks pretty similar to yours at first glance.
>

Hi Peter,

I looked at your implementation of the URI extension. You are using
a third party library uriparser for URI/URL parsing. The uriparser
library is based on the RFC 3986 specification, which I cite in
comparison to WHATWG in my email. The name of the functions to get the
individual parts of the URL will naturally be the same, that's how
URLs/URIs work. But you have to look at the point and perspective here.

As I've written before, there is a difference between parsing URLs
according to the RFC 3986 specification and WHATWG URLs. This is
especially true for host. Here are a couple more examples.

The specifications describe character encoding differently, this
applies to userinfo (username/password), path, query, fragment.
RFC 3986 standard encodes characters in the same way everywhere.
The characters to encode are: < > “ ` \r \n \t { } | \ ^ '.
The WHATWG URL standard takes a more selective and subtle approach.
For example, path may contain the character “|”, but userinfo does
not (encoded).

The WHATWG specification also requires that tabs and newlines be removed
from URLs before parsing. The WHATWG character encoding gradation can
be found in the specification [1].

In addition to functions to retrieve individual parts, the WHATWG URL
standard describes an API for changing them: scheme, userinfo, host,
hostname, port, path, query, fragment. There is not just one value is
replaced by another, there is a certain logic, which is not always
obvious. For example, try to replace scheme, let's take the URL for
example: https://example.com/.
This URL contains special scheme (any others are not special), there
are only six of them: ftp, file, http, https, ws, wss. And it is
impossible to replace scepial scheme with a non-special one. More
precisely, the URL will be returned with special scheme, i.e. without
changes. This is how you can check it with the patch I have given:

Example: select url_scheme_set('https://example.com/'::url, 'wss');
Result: wss://example.com/

Example: select url_scheme_set('https://example.com/'::url, 'myown');
Result: https://example.com/

Example: select url_scheme_set('best://example.com/'::url, 'myown');
Result: myown://example.com/

Example: select url_scheme_set('best://example.com/'::url, 'https');
Result: best://example.com/

In addition, WHATWG validates URLs during parsing and reports
non-critical errors [2]. If such errors occur, parsing continues.
However, in my implementation I don't output these errors, I just
haven't figured out how to do it correctly in SQL (as NOTICE?).

Without going further into the differences in specifications I could
say simply - RFC 3986 is obsolete, for example, node.js has labeled
the API with it as Legacy (they use WHATWG).

If we abstract from specifications and consider our approaches in
implementation. You parse the URL every time for any request
(even to retrieve fragments). In my implementation I proceed from
the fact that requests to read URLs will significantly exceed their
changes. Parsing is done once on input, the result is saved in
a special format and later the necessary parts of the URL
(or the whole URL) are retrieved. Also please note that there are
no dependencies on third-party libraries (ICU dependencies can also
be fought off).

All currently available functions and examples can be seen in
the README file, in the patch.

[1] https://url.spec.whatwg.org/#c0-control-percent-encode-set
[2] https://url.spec.whatwg.org/#writing

--
Alexander Borisov

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-12-06 13:46:01 Re: why there is not VACUUM FULL CONCURRENTLY?
Previous Message Ashutosh Bapat 2024-12-06 12:49:36 Re: logical replication: patch to ensure timely cleanup of aborted transactions in ReorderBuffer