Re: Strange output of XML attribute values

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: radist-hack(at)yandex(dot)ru
Cc: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: Strange output of XML attribute values
Date: 2020-09-16 13:40:20
Message-ID: CAFj8pRAvUqUmGFjDYhNt32Udd87dL_0e2muerxwCFw9C08qx8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

st 16. 9. 2020 v 14:50 odesílatel Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
napsal:

>
>
> st 16. 9. 2020 v 14:11 odesílatel Andrew Marynchuk (Андрей Маринчук) <
> radist(dot)nt(at)gmail(dot)com> napsal:
>
>> This problem is quite old, but it leads to the inability to use XML
>> generation functions in PostgreSQL database for some cases, or at least
>> requires to perform subsequent parsing and regenerating XML by an external
>> utility. It reproduces in PostgreSQL 12.4, compiled by Visual C++ build
>> 1914, 64-bit (windows 10), but I've seen the same problem in 9.6 build from
>> CentOS yum package.
>>
>> *How to reproduce*:
>> Just execute the query (actually the xmlelement call is enough to
>> reproduce the proble):
>> select xmlserialize(document xmlroot(xmlelement(name "ЭлементВКириллице",
>> xmlattributes('ЗначениеВКириллице' as "АтрибутВКириллице"),
>> 'ТекстВКириллице'), version '1.0', standalone yes) as text);
>>
>> *Expected result*:
>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>> АтрибутВКириллице="ЗначениеВКириллице">ТекстВКириллице</ЭлементВКириллице>
>>
>> *Actual result*:
>> <?xml version="1.0" standalone="yes"?><ЭлементВКириллице
>> АтрибутВКириллице="&#x417;&#x43D;&#x430;&#x447;&#x435;&#x43D;&#x438;&#x435;&#x412;&#x41A;&#x438;&#x440;&#x438;&#x43B;&#x43B;&#x438;&#x446;&#x435;">ТекстВКириллице</ЭлементВКириллице>
>>
>> This example uses cyrillic letters, but it could be any non-ASCII
>> character.
>> According to the discussion
>> <https://www.sql.ru/forum/775061/russkiy-yazyk-v-xml?hl=libxml>, this
>> problem arises because PostgreSQL does not provides libxml2 an information
>> of document encoding due to the lack of xmlTextWriterStartDocument call, so
>> libxml2 has no idea that encoding is UTF-8 and non-ASCII characters could
>> be written without converting to &#x...;-sequences.
>>
>
> I don't think it is true. The url encoding is done only in attributes, and
> only when the output encoding will be utf8. When you try to use 8bit
> encoding with Azbuka support, it will be ok.
>
>
>>
>> In the modern world, UTF-8 encoding is used everywhere and such
>> unnecessary character converting looks strange. Current workaround is
>> passing generated content to the pl/python function which parses and writes
>> back the xml (xml.dom.minidom.parseString(...).toxml()).
>>
>
> If I remember some discussion about this topic, the problem is XML
> standard, that requires url encoding in attribute values.
>
> I reported this issue 10 (maybe 15) years ago to libxml2 developers, and
> It was rejected. Maybe libxml2 supports too old XML standards. I don't know
> - this library is years in frozen state, but there is no replacement.
>
> It is a libxml2 problem - , and there it should be reported and fixed. It
> is not possible to fix this issue on Postgres' side.
>

xmlTextWriterWriteAttribute does this unwanted encoding,
xmlTextWriterWriteRaw doesn't do this

postgres=# select xmlelement(name "aho", xmlattributes('žlutý kůň' as x),
'žlutý kǔň');
NOTICE: >>>>>žlutý kůň
NOTICE: **** >>>>>žlutý kǔň
┌───────────────────────────────────────────────────────────┐
│ xmlelement │
╞═══════════════════════════════════════════════════════════╡
│ <aho x="&#x17E;lut&#xFD; k&#x16F;&#x148;">žlutý kǔň</aho> │
└───────────────────────────────────────────────────────────┘
(1 row)

So somewhere there will be necessary information for understanding this
issue.

Regards

Pavel

>
> Regards
>
> Pavel
>
>

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Anastasia Lubennikova 2020-09-16 14:24:00 Re: BUG #16619: Amcheck detects corruption in hstore' btree index (ver 2)
Previous Message PG Bug reporting form 2020-09-16 13:39:08 BUG #16620: Autovacuum does not process certain databases after migration from postgresql 10