From: | Jim Jones <jim(dot)jones(at)uni-muenster(dot)de> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | [PoC] Add CANONICAL option to xmlserialize |
Date: | 2023-02-27 13:16:30 |
Message-ID: | 67fa8560-8d61-5d06-8178-fc9c7684db90@uni-muenster.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
In order to compare pairs of XML documents for equivalence it is
necessary to convert them first to their canonical form, as described at
W3C Canonical XML 1.1.[1] This spec basically defines a standard
physical representation of xml documents that have more then one
possible representation, so that it is possible to compare them, e.g.
forcing UTF-8 encoding, entity reference replacement, attributes
normalization, etc.
Although it is not part of the XML/SQL standard, it would be nice to
have the option CANONICAL in xmlserialize. Additionally, we could also
add the attribute WITH [NO] COMMENTS to keep or remove xml comments from
the documents.
Something like this:
WITH t(col) AS (
VALUES
('<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE doc SYSTEM "doc.dtd" [
<!ENTITY val "42">
<!ATTLIST xyz attr CDATA "default">
]>
<!-- ordering of attributes -->
<foo ns:c = "3" ns:b = "2" ns:a = "1"
xmlns:ns="http://postgresql.org">
<!-- Normalization of whitespace in start and end tags -->
<!-- Elimination of superfluous namespace declarations,
as already declared in <foo> -->
<bar xmlns:ns="http://postgresql.org" >&val;</bar >
<!-- Empty element conversion to start-end tag pair -->
<empty/>
<!-- Effect of transcoding from a sample encoding to UTF-8 -->
<iso8859>©</iso8859>
<!-- Addition of default attribute -->
<!-- Whitespace inside tag preserved -->
<xyz> 321 </xyz>
</foo>
<!-- comment outside doc -->'::xml)
)
SELECT xmlserialize(DOCUMENT col AS text CANONICAL) FROM t;
xmlserialize
--------------------------------------------------------------------------------------------------------------------------------------------------------
<foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2"
ns:c="3"><bar>42</bar><empty></empty><iso8859>©</iso8859><xyz
attr="default"> 321 </xyz></foo>
(1 row)
-- using WITH COMMENTS
WITH t(col) AS (
VALUES
(' <foo ns:c = "3" ns:b = "2" ns:a = "1"
xmlns:ns="http://postgresql.org">
<!-- very important comment -->
<xyz> 321 </xyz>
</foo>'::xml)
)
SELECT xmlserialize(DOCUMENT col AS text CANONICAL WITH COMMENTS) FROM t;
xmlserialize
------------------------------------------------------------------------------------------------------------------------
<foo xmlns:ns="http://postgresql.org" ns:a="1" ns:b="2" ns:c="3"><!--
very important comment --><xyz> 321 </xyz></foo>
(1 row)
Another option would be to simply create a new function, e.g.
xmlcanonical(doc xml, keep_comments boolean), but I'm not sure if this
would be the right approach.
Attached a very short draft. What do you think?
Best, Jim
Attachment | Content-Type | Size |
---|---|---|
Add-CANONICAL-Option-to-xmlserialize-draft.patch | text/x-patch | 8.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Dagfinn Ilmari Mannsåker | 2023-02-27 13:22:53 | Adding argument names to aggregate functions |
Previous Message | Juan José Santamaría Flecha | 2023-02-27 13:11:34 | Re: buildfarm + meson |