Re: [PATCH] Add CANONICAL option to xmlserialize

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
Cc: Chapman Flack <chap(at)anastigmatix(dot)net>, vignesh C <vignesh21(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Add CANONICAL option to xmlserialize
Date: 2024-08-26 10:30:28
Message-ID: CAFj8pRDN7yEPhKrr85ZWP_udF20R7qs0nvwsZhzQGpVp9fPRUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

po 26. 8. 2024 v 11:32 odesílatel Jim Jones <jim(dot)jones(at)uni-muenster(dot)de>
napsal:

> Hi Pavel
>
> On 25.08.24 20:57, Pavel Stehule wrote:
> >
> > There is unwanted white space in the patch
> >
> > -<-><--><-->xmlFreeDoc(doc);
> > +<->else if (format == XMLSERIALIZE_CANONICAL || format ==
> > XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
> > + <>{
> > +<-><-->xmlChar *xmlbuf = NULL;
> > +<-><-->int nbytes;
> > +<-><-->int
> >
> I missed that one. Just removed it, thanks!
> > 1. the xml is serialized to UTF8 string every time, but when target
> > type is varchar or text, then it should be every time encoded to
> > database encoding. Is not possible to hold utf8 string in latin2
> > database varchar.
> I'm calling xml_parse using GetDatabaseEncoding(), so I thought I would
> be on the safe side
>
> if(format ==XMLSERIALIZE_CANONICAL ||format
> ==XMLSERIALIZE_CANONICAL_WITH_NO_COMMENTS)
> doc =xml_parse(data, XMLOPTION_DOCUMENT, false,
> GetDatabaseEncoding(), NULL, NULL, NULL);
> ... or you mean something else?
>

Maybe I was confused by the initial message.

>
> > 2. The proposed feature can increase some confusion in implementation
> > of NO IDENT. I am not an expert on this area, so I checked other
> > databases. DB2 does not have anything similar. But Oracle's "NO IDENT"
> > clause is very similar to the proposed "CANONICAL". Unfortunately,
> > there is different behaviour of NO IDENT - Oracle's really removes
> > formatting, Postgres does nothing.
>
> Coincidentally, the [NO] INDENT support for xmlserialize is an old patch
> of mine.
> It basically "does nothing" and prints the xml as is, e.g.
>
> SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
> a="8"><![CDATA[0&1]]></val></bar></foo>' AS text INDENT);
> xmlserialize
> --------------------------------------------
> <foo> +
> <bar> +
> <val z="1" a="8"><![CDATA[0&1]]></val>+
> </bar> +
> </foo> +
>
> (1 row)
>
> SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
> a="8"><![CDATA[0&1]]></val></bar></foo>' AS text NO INDENT);
> xmlserialize
> --------------------------------------------------------------
> <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
> (1 row)
>
> SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
> a="8"><![CDATA[0&1]]></val></bar></foo>' AS text);
> xmlserialize
> --------------------------------------------------------------
> <foo><bar><val z="1" a="8"><![CDATA[0&1]]></val></bar></foo>
> (1 row)
>
> .. while CANONICAL converts the xml to its canonical form,[1,2] e.g.
> sorting attributes and replacing CDATA strings by its value:
>
> SELECT xmlserialize(DOCUMENT '<foo><bar><val z="1"
> a="8"><![CDATA[0&1]]></val></bar></foo>' AS text CANONICAL);
> xmlserialize
> ------------------------------------------------------
> <foo><bar><val a="8" z="1">0&amp;1</val></bar></foo>
> (1 row)
>
> xmlserialize CANONICAL does not exist in any other database and it's not
> part of the SQL/XML standard.
>
> Regarding the different behaviour of NO INDENT in Oracle and PostgreSQL:
> it is not entirely clear to me if SQL/XML states that NO INDENT must
> remove the indentation from xml strings.
> It says:
>
> "INDENT — the choice of whether to “pretty-print” the serialized XML by
> means of indentation, either
> True or False.
> ....
> i) If <XML serialize indent> is specified and does not contain NO, then
> let IND be True.
> ii) Otherwise, let IND be False."
>
> When I wrote the patch I assumed it meant to leave the xml as is .. but
> I might be wrong.
> Perhaps it would be best if we open a new thread for this topic.
>

I think so there should be specified the target of CANONICAL - it is a
partial replacement of NO INDENT or it produces format just for comparing?
The CANONICAL format is not probably extra standardized, because libxml2
removes indenting, but examples in https://www.w3.org/TR/xml-c14n11/
doesn't do it. So this format makes sense just for local operations.

I like this functionality, and it is great so the functionality from
libxml2 can be used, but I think, so the fact that there are four not
compatible implementations of xmlserialize is messy. Can be nice, if we
find some intersection between SQL/XML, Oracle instead of new proprietary
syntax.

In Oracle syntax the CANONICAL is +/- NO INDENT SHOW DEFAULT ?

My objection against CANONICAL so SQL/XML and Oracle allows to parametrize
XMLSERIALIZE more precious and before implementing new feature, we should
to clean table and say, what we want to have in XMLSERIALIZE.

An alternative of enhancing of XMLSERIALIZE I can imagine just function
"to_canonical(xml, without_comments bool default false)". In this case we
don't need to solve relations against SQL/XML or Oracle.

> Thank you for reviewing this patch. Much appreciated!
>
> Best,
>
> --
> Jim
>
> 1 - https://www.w3.org/TR/xml-c14n11/
> 2 - https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-c14n.html
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-08-26 10:36:29 Re: Conflict detection and logging in logical replication
Previous Message Heikki Linnakangas 2024-08-26 09:59:10 Re: Cleaning up threading code