Re: [PATCH] Add CANONICAL option to xmlserialize

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Vik Fearing <vik(at)postgresfriends(dot)org>
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize
Date: 2024-08-26 12:15:56
Message-ID: CAFj8pRAjy1ahmKNUQYYegj2_gsfEhy75hEN04-WurxanXec85g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

po 26. 8. 2024 v 13:28 odesílatel Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
napsal:

>
>
> On 26.08.24 12:30, Pavel Stehule wrote:
> > I think so there should be specified the target of CANONICAL - it is a
> > partial replacement of NO INDENT or it produces format just for
> > comparing? The CANONICAL format is not probably extra standardized,
> > because libxml2 removes indenting, but examples in
> > https://www.w3.org/TR/xml-c14n11/ doesn't do it. So this format makes
> > sense just for local operations.
> My idea with CANONICAL was not to replace NO INDENT. The intent was to
> format xml strings in an standardized way, so that they can be compared.
> For instance, removing comments, sorting attributes, converting CDATA
> strings, converting empty elements to start-end tag pairs, removing
> white spaces between elements, etc ...
>
> The W3C recommendation for Canonical XML[1] dictates the following
> regarding the removal of whitespaces between elements :
>
> * Whitespace outside of the document element and within start and end
> tags is normalized
> * All whitespace in character content is retained (excluding characters
> removed during line feed normalization)
>
> >
> > I like this functionality, and it is great so the functionality from
> > libxml2 can be used, but I think, so the fact that there are four not
> > compatible implementations of xmlserialize is messy. Can be nice, if
> > we find some intersection between SQL/XML, Oracle instead of new
> > proprietary syntax.
> >
> > In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?
>
> No.
> XMLSERIALIZE ... NO INDENT is supposed, as the name suggests, to
> serialize an xml string without indenting it. One could argue that not
> indenting can be translated as removing indentation, but I couldn't find
> anything concrete about this in the SQL/XML spec. If it's indeed the
> case, we should correct XMLSERIALIZE .. NO INDENT, but it is unrelated
> to this patch.
>
> CANONICAL serializes a physical representation of an xml document. In a
> nutshell, XMLSERIALIZE ... CANONICAL sort of "rewrites" the xml string
> with the following rules (list from the W3C recommendation):
>
> * The document is encoded in UTF-8
> * Line breaks normalized to #xA on input, before parsing
> * Attribute values are normalized, as if by a validating processor
> * Character and parsed entity references are replaced
> * CDATA sections are replaced with their character content
> * The XML declaration and document type declaration are removed
> * Empty elements are converted to start-end tag pairs
> * Whitespace outside of the document element and within start and end
> tags is normalized
> * All whitespace in character content is retained (excluding characters
> removed during line feed normalization)
> * Attribute value delimiters are set to quotation marks (double quotes)
> * Special characters in attribute values and character content are
> replaced by character references
> * Superfluous namespace declarations are removed from each element
> * Default attributes are added to each element
> * Fixup of xml:base attributes [C14N-Issues] is performed
> * Lexicographic order is imposed on the namespace declarations and
> attributes of each element
>
> btw: Oracle's SIZE =, HIDE DEFAULTS, and SHOW DEFAULTS are not part of
> the SQL/XML standard either :)
>

I know - looks so this function is not well designed generally

>
> > My objection against CANONICAL so SQL/XML and Oracle allows to
> > parametrize XMLSERIALIZE more precious and before implementing new
> > feature, we should to clean table and say, what we want to have in
> > XMLSERIALIZE.
> >
> > An alternative of enhancing of XMLSERIALIZE I can imagine just
> > function "to_canonical(xml, without_comments bool default false)". In
> > this case we don't need to solve relations against SQL/XML or Oracle.
>
> To create a separated serialization function would be IMHO way less
> elegant than to parametrize XMLSERIALIZE, but it would be something I
> could live with in case we decide to go down this path.
>

I am not strongly against enhancing XMLSERIALIZE, but it can be nice to see
some wider concept first. Currently the state looks just random - and I
didn't see any serious discussion about implementation fo SQL/XML. We don't
need to be necessarily compatible with Oracle, but it can help if we have a
functionality that can be used for conversions.

> Thanks!
>
> --
> Jim
>
> 1 - https://www.w3.org/TR/xml-c14n11/
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-08-26 12:21:26 Re: Index AM API cleanup
Previous Message Zhijie Hou (Fujitsu) 2024-08-26 12:14:19 RE: Conflict detection and logging in logical replication