Re: Strange output of XML attribute values

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: radist-hack(at)yandex(dot)ru
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Strange output of XML attribute values
Date: 2020-09-17 03:49:14
Message-ID: CAFj8pRBJiQnykgF71JwYdyMci74XU4Dk2iA7wPM=aYU1yKEdEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

st 16. 9. 2020 v 22:23 odesílatel Andrew Marynchuk (Андрей Маринчук) <
radist(dot)nt(at)gmail(dot)com> napsal:

>
>
> ср, 16 сент. 2020 г. в 15:51, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>
>>
>>
>> st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <
>> radist(dot)nt(at)gmail(dot)com> napsal:
>>
>>> This problem is quite old, but it leads to the inability to use XML
>>> generation functions in PostgreSQL database for some cases, or at least
>>> requires to perform subsequent parsing and regenerating XML by an external
>>> utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build
>>> 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from
>>> CentOS yum package.
>>>
>>> *How to reproduce*:
>>> Just execute the query (actually the xmlelement call is enough to
>>> reproduce the proble):
>>> select xmlserialize(document xmlroot(xmlelement(name
>>> "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as
>>> "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as
>>> text);
>>>
>>> *Expected result*:
>>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>>> АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>
>>>
>>> *Actual result*:
>>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>>> АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>
>>>
>>> This example uses cyrillic letters, but it could be any non-ASCII
>>> character.
>>> According to the discussion
>>> <https://www.sql.ru/forum/775061/russkiy-yazyk-v-xml?hl=libxml>, this
>>> problem arises because PostgreSQL does not provides libxml2 an information
>>> of document encoding due to the lack of xmlTextWriterStartDocument call, so
>>> libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could
>>> be written without converting to &#x...;-sequences.
>>>
>>
>> I don't think it is true. The url encoding is done only in attributes,
>> and only when the output encoding will be utf8. When you try to use 8bit
>> encoding with Azbuka support, it will be ok.
>>
>
> I've made some investigation on libxml2 sources
> <https://gitlab.gnome.org/GNOME/libxml2>.
> Seems like urlencoding is done in xmlBufAttrSerializeTxtContent function
> from xmlsave.c after the condition
> <https://gitlab.gnome.org/GNOME/libxml2/-/blob/00a86d414ba9a9e1cd588182b87518e4e3af9466/xmlsave.c#L2035>
> is met. The key part is (doc == NULL) || (doc->encoding == NULL).
> The xmlTextWriterWriteString is the only function calls
> xmlBufAttrSerializeTxtContent in xmlwriter.c and only for
> XML_TEXTWRITER_ATTRIBUTE state
> <https://gitlab.gnome.org/GNOME/libxml2/-/blob/00a86d414ba9a9e1cd588182b87518e4e3af9466/xmlwriter.c#L1504>,
> so urlencoding is really done only for attributes, but it could be turned
> off by providing the xmlwriter with encoding.
> But there is no api function except xmlTextWriterStartDocument to modify
> the (xmlTextWriterPtr argument) -> doc -> encoding field.
>
> In other words, the libxml2 library does not know the codepage and assumes
> it as ASCII, so it turns on the urlencoding of all attributes. I don't have
> an idea why all other textual document parts are passed as is.Seems like
> PostgreSQL never specifies the codepage (I haven't seen a encoding
> declaration in document prolog for documents generated in PostgreSQL). It's
> reasonable because charset is not defined for varchar type and xml
> converted to varchar is some abstract text which could be converted to any
> codepage.
>

xml is usually dumped in client side encoding

<--><-->xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name);

<--><-->forboth(arg, named_arg_strings, narg, xexpr->arg_names)
<--><-->{
<--><--><-->char<--> *str = (char *) lfirst(arg);
<--><--><-->char<--> *argname = strVal(lfirst(narg));

<--><--><-->if (str)
<--><--><--><-->xmlTextWriterWriteAttribute(writer,
<--><--><--><--><--><--><--><--><--><--><-->(xmlChar *) argname,
<--><--><--><--><--><--><--><--><--><--><-->(xmlChar *) str);
<--><-->}

<--><-->foreach(arg, arg_strings)
<--><-->{
<--><--><-->char<--> *str = (char *) lfirst(arg);

<--><--><-->xmlTextWriterWriteRaw(writer, (xmlChar *) str);
<--><-->}

<--><-->xmlTextWriterEndElement(writer);

There is a code from xml.c. Is there some possibility how to specify
encoding?

>
>
>>
>>>
>>> In the modern world, UTF-8 encoding is used everywhere and such
>>> unnecessary character converting looks strange. Current workaround is
>>> passing generated content to the pl/python function which parses and writes
>>> back the xml (xml.dom.minidom.parseString(...).toxml()).
>>>
>>
>> If I remember some discussion about this topic, the problem is XML
>> standard, that requires url encoding in attribute values.
>>
>> I reported this issue 10 (maybe 15) years ago to libxml2 developers, and
>> It was rejected. Maybe libxml2 supports too old XML standards. I don't know
>> - this library is years in frozen state, but there is no replacement.
>>
>
> I traced the description of attributes from the current specification
> through previous versions till the draft version published in 1996, but
> didn't find such a requirement. Attribute value defined as '"' ([^<&"] |
> Reference <https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Reference>)*
> '"' | "'" ([^<&'] | Reference
> <https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Reference>)* "'", so
> only three characters are disallowed in attributes.
>

I tried to find this topic, but without success. Maybe my memory doesn't
serve well, but I think I solved a similar issue, and I had to generate a
document explicitly with latin2 encoding to fix this issue.

But it is in agreement with your description - if we generate xml without
specified encoding (we expect UTF8), then url encoding is active.

If you can play with it - all related codes for postgres side are in xml.c
source code file.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2020-09-17 05:53:27 Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch
Previous Message Tom Lane 2020-09-17 01:19:54 Re: BUG #16620: Autovacuum does not process certain databases after migration from postgresql 10