Re: [PATCH] Add CANONICAL option to xmlserialize

From: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Vik Fearing <vik(at)postgresfriends(dot)org>
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize
Date: 2024-08-26 11:28:22
Message-ID: 8e904363-586c-4de9-9763-f9d8362f8306@uni-muenster.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.08.24 12:30, Pavel Stehule wrote:
> I think so there should be specified the target of CANONICAL - it is a
> partial replacement of NO INDENT or it produces format  just for
> comparing? The CANONICAL format is not probably extra standardized,
> because libxml2 removes indenting, but examples in
> https://www.w3.org/TR/xml-c14n11/ doesn't do it. So this format makes
> sense just for local operations.
My idea with CANONICAL was not to replace NO INDENT. The intent was to
format xml strings in an standardized way, so that they can be compared.
For instance, removing comments, sorting attributes, converting CDATA
strings, converting empty elements to start-end tag pairs, removing
white spaces between elements, etc ...

The W3C recommendation for Canonical XML[1] dictates the following
regarding the removal of whitespaces between elements :

* Whitespace outside of the document element and within start and end
tags is normalized
* All whitespace in character content is retained (excluding characters
removed during line feed normalization)

>
> I like this functionality, and it is great so the functionality from
> libxml2 can be used, but I think, so the fact that there are  four not
> compatible implementations of xmlserialize is messy. Can be nice, if 
> we find some intersection between SQL/XML, Oracle  instead of new
> proprietary syntax. 
>
> In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?

No.
XMLSERIALIZE ... NO INDENT is supposed, as the name suggests, to
serialize an xml string without indenting it. One could argue that not
indenting can be translated as removing indentation, but I couldn't find
anything concrete about this in the SQL/XML spec. If it's indeed the
case, we should correct XMLSERIALIZE .. NO INDENT, but it is unrelated
to this patch.

CANONICAL serializes a physical representation of an xml document. In a
nutshell, XMLSERIALIZE ... CANONICAL sort of "rewrites" the xml string
with the following rules (list from the W3C recommendation):

* The document is encoded in UTF-8
* Line breaks normalized to #xA on input, before parsing
* Attribute values are normalized, as if by a validating processor
* Character and parsed entity references are replaced
* CDATA sections are replaced with their character content
* The XML declaration and document type declaration are removed
* Empty elements are converted to start-end tag pairs
* Whitespace outside of the document element and within start and end
tags is normalized
* All whitespace in character content is retained (excluding characters
removed during line feed normalization)
* Attribute value delimiters are set to quotation marks (double quotes)
* Special characters in attribute values and character content are
replaced by character references
* Superfluous namespace declarations are removed from each element
* Default attributes are added to each element
* Fixup of xml:base attributes [C14N-Issues] is performed
* Lexicographic order is imposed on the namespace declarations and
attributes of each element

btw: Oracle's SIZE =, HIDE DEFAULTS, and SHOW DEFAULTS are not part of
the SQL/XML standard either :)

> My objection against CANONICAL so SQL/XML and Oracle allows to
> parametrize XMLSERIALIZE more precious and before implementing new
> feature, we should to clean table and say, what we want to have in
> XMLSERIALIZE.
>
> An alternative of enhancing of XMLSERIALIZE I can imagine just
> function "to_canonical(xml, without_comments bool default false)". In
> this case we don't need to solve relations against SQL/XML or Oracle.

To create a separated serialization function would be IMHO way less
elegant than to parametrize XMLSERIALIZE, but it would be something I
could live with in case we decide to go down this path.

Thanks!

--
Jim

1 - https://www.w3.org/TR/xml-c14n11/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2024-08-26 11:31:37 Re: [PATCH] Add get_bytes() and set_bytes() functions
Previous Message Dean Rasheed 2024-08-26 11:24:03 Re: Adding OLD/NEW support to RETURNING