Re: xPath in a database with LATIN1 encoding

From: Holger Jakobs <holger(at)jakobs(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: xPath in a database with LATIN1 encoding
Date: 2021-04-25 08:44:42
Message-ID: caa976c4-6070-7eb8-1570-dee06b876a58@jakobs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin


Am 24.04.21 um 20:20 schrieb Jorge Silva:
> Thanks for the quick reply, Tom. I am trying something simpler. I am
> trying to find a way to run the xPath function with a xml file type,
> which has latin characters, such as:
> *SELECT*xpath(‘//xml_test/text()’,
> convert_from(convert_to(‘<xml_test>çã</xml_test>','utf-8’),'utf-8')*::*xml)
>
> This line returns the same error, as follows:
> [Code: 0, SQL State: 2200M]  ERROR: could not parse XML document
>   Detail: line 1: Input is not proper UTF-8, indicate encoding !
> Bytes: 0xE7 0xE3 0x3C 0x2F
> <xml_test>çã</xml_test>
>           ^
> This happens because the “text”, which is the output from
> convert_from() function is encoded with LATIN1, the database encoding
> set, and not UTF-8.  Is there a way that a text variable is not
> encoded as the database encoding set, but some other encoding set,

> such as UTF-8? From what I’ve searched for, it seems that it would be
> something similar to nvarchar that only exists in Microsoft SQL Server.
>
>
>
>> On 24 Apr 2021, at 13:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us
>> <mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us>> wrote:
>>
>> Jorge Silva <jorge(dot)silva93(at)gmail(dot)com
>> <mailto:jorge(dot)silva93(at)gmail(dot)com>> writes:
>>> The characters that it is not recognizing are both “ç” and “ã”
>>> because I think they are encoded differently in latin1 and utf-8. Is
>>> it possible to somehow use the xPath function with special
>>> characters in the XML and in a database which is not encoded with
>>> utf-8?
>>
>> I don't have a lot of expertise in this area, but I think you need
>> an explicit encoding indicator in the xml header, a la
>>
>>    <?xml encoding="latin1"?> ...
>>
>> On the whole, the xml type is definitely easier to use with database
>> encoding set to utf8.  I think you'll be paying for encoding conversion
>> every time we interact with libxml, for instance.
>>
>> regards, tom lane
>
The point is that the XML standard dictates that every document without
an explicit encoding, must be encodied in UTF-8. So, whenever you want
to use a different encoding, this has to be stated in the XML header, as
indicated in Tom Lane's answer.

So the encoding of XML columns isn't actually the one of the database
(although UTF-8 is best in any case), but simply is the XML standard.
Your client encoding may vary, though, but this would limit the
characters which can be transferred.

Regards,

Holger

--
Holger Jakobs, Bergisch Gladbach, Tel. +49-178-9759012

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message adnan salam 2021-04-25 12:56:22 Facing issue while upgrading from 11 ro 13.2
Previous Message Jorge Silva 2021-04-24 18:20:43 Re: xPath in a database with LATIN1 encoding