From: | Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> |
---|---|
To: | radist-hack(at)yandex(dot)ru |
Cc: | PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Strange output of XML attribute values |
Date: | 2020-09-17 03:49:14 |
Message-ID: | CAFj8pRBJiQnykgF71JwYdyMci74XU4Dk2iA7wPM=aYU1yKEdEA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
st 16. 9. 2020 v 22:23 odesílatel Andrew Marynchuk (Андрей Маринчук) <
radist(dot)nt(at)gmail(dot)com> napsal:
>
>
> ср, 16 сент. 2020 г. в 15:51, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>
>>
>>
>> st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <
>> radist(dot)nt(at)gmail(dot)com> napsal:
>>
>>> This problem is quite old, but it leads to the inability to use XML
>>> generation functions in PostgreSQL database for some cases, or at least
>>> requires to perform subsequent parsing and regenerating XML by an external
>>> utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build
>>> 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from
>>> CentOS yum package.
>>>
>>> *How to reproduce*:
>>> Just execute the query (actually the xmlelement call is enough to
>>> reproduce the proble):
>>> select xmlserialize(document xmlroot(xmlelement(name
>>> "ЭлементВКириллице", xmlattributes('ЗначениеВКириллице' as
>>> "АтрибутВКириллице"), 'ТекстВКириллице'), version '1.0', standalone yes) as
>>> text);
>>>
>>> *Expected result*:
>>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>>> АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>
>>>
>>> *Actual result*:
>>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>>> АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>
>>>
>>> This example uses cyrillic letters, but it could be any non-ASCII
>>> character.
>>> According to the discussion
>>> <https://www.sql.ru/forum/775061/russkiy-yazyk-v-xml?hl=libxml>, this
>>> problem arises because PostgreSQL does not provides libxml2 an information
>>> of document encoding due to the lack of xmlTextWriterStartDocument call, so
>>> libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could
>>> be written without converting to &#x...;-sequences.
>>>
>>
>> I don't think it is true. The url encoding is done only in attributes,
>> and only when the output encoding will be utf8. When you try to use 8bit
>> encoding with Azbuka support, it will be ok.
>>
>
> I've made some investigation on libxml2 sources
> <https://gitlab.gnome.org/GNOME/libxml2>.
> Seems like urlencoding is done in xmlBufAttrSerializeTxtContent function
> from xmlsave.c after the condition
> <https://gitlab.gnome.org/GNOME/libxml2/-/blob/00a86d414ba9a9e1cd588182b87518e4e3af9466/xmlsave.c#L2035>
> is met. The key part is (doc == NULL) || (doc->encoding == NULL).
> The xmlTextWriterWriteString is the only function calls
> xmlBufAttrSerializeTxtContent in xmlwriter.c and only for
> XML_TEXTWRITER_ATTRIBUTE state
> <https://gitlab.gnome.org/GNOME/libxml2/-/blob/00a86d414ba9a9e1cd588182b87518e4e3af9466/xmlwriter.c#L1504>,
> so urlencoding is really done only for attributes, but it could be turned
> off by providing the xmlwriter with encoding.
> But there is no api function except xmlTextWriterStartDocument to modify
> the (xmlTextWriterPtr argument) -> doc -> encoding field.
>
> In other words, the libxml2 library does not know the codepage and assumes
> it as ASCII, so it turns on the urlencoding of all attributes. I don't have
> an idea why all other textual document parts are passed as is.Seems like
> PostgreSQL never specifies the codepage (I haven't seen a encoding
> declaration in document prolog for documents generated in PostgreSQL). It's
> reasonable because charset is not defined for varchar type and xml
> converted to varchar is some abstract text which could be converted to any
> codepage.
>
xml is usually dumped in client side encoding
<--><-->xmlTextWriterStartElement(writer, (xmlChar *) xexpr->name);
<--><-->forboth(arg, named_arg_strings, narg, xexpr->arg_names)
<--><-->{
<--><--><-->char<--> *str = (char *) lfirst(arg);
<--><--><-->char<--> *argname = strVal(lfirst(narg));
<--><--><-->if (str)
<--><--><--><-->xmlTextWriterWriteAttribute(writer,
<--><--><--><--><--><--><--><--><--><--><-->(xmlChar *) argname,
<--><--><--><--><--><--><--><--><--><--><-->(xmlChar *) str);
<--><-->}
<--><-->foreach(arg, arg_strings)
<--><-->{
<--><--><-->char<--> *str = (char *) lfirst(arg);
<--><--><-->xmlTextWriterWriteRaw(writer, (xmlChar *) str);
<--><-->}
<--><-->xmlTextWriterEndElement(writer);
There is a code from xml.c. Is there some possibility how to specify
encoding?
>
>
>>
>>>
>>> In the modern world, UTF-8 encoding is used everywhere and such
>>> unnecessary character converting looks strange. Current workaround is
>>> passing generated content to the pl/python function which parses and writes
>>> back the xml (xml.dom.minidom.parseString(...).toxml()).
>>>
>>
>> If I remember some discussion about this topic, the problem is XML
>> standard, that requires url encoding in attribute values.
>>
>> I reported this issue 10 (maybe 15) years ago to libxml2 developers, and
>> It was rejected. Maybe libxml2 supports too old XML standards. I don't know
>> - this library is years in frozen state, but there is no replacement.
>>
>
> I traced the description of attributes from the current specification
> through previous versions till the draft version published in 1996, but
> didn't find such a requirement. Attribute value defined as '"' ([^<&"] |
> Reference <https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Reference>)*
> '"' | "'" ([^<&'] | Reference
> <https://www.w3.org/TR/2004/REC-xml-20040204/#NT-Reference>)* "'", so
> only three characters are disallowed in attributes.
>
I tried to find this topic, but without success. Maybe my memory doesn't
serve well, but I think I solved a similar issue, and I had to generate a
document explicitly with latin2 encoding to fix this issue.
But it is in agreement with your description - if we generate xml without
specified encoding (we expect UTF8), then url encoding is active.
If you can play with it - all related codes for postgres side are in xml.c
source code file.
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2020-09-17 05:53:27 | Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch |
Previous Message | Tom Lane | 2020-09-17 01:19:54 | Re: BUG #16620: Autovacuum does not process certain databases after migration from postgresql 10 |