From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | PG Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | XPATH vs. server_encoding != UTF-8 |
Date: | 2011-07-23 15:49:37 |
Message-ID: | AA288A66-0451-4AC4-9DBD-1AFC7892F25D@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi
The current thread about JSON and the ensuing discussion about the
XML types' behaviour in non-UTF8 databases made me try out how well
XPATH() copes with that situation. The code, at least, looks
suspicious - XPATH neither verifies that the server encoding is UTF-8,
not does it pass the server encoding on to libxml's xpath functions.
So I created a database with encoding ISO-8859-1 (LATIN1), and did
(which aclient encoding matching my terminal's settings)
CREATE TABLE X (d XML);
INSERT INTO X VALUES ('<r a="ä"/>');
i.e, I inserted the XML document <r a="ä"/>, but without using
an entity reference for the german Umlaut-A. Then I attempted to extract
the length of r's attribute "a" with the XPATH /r/@a, both with the XPath
function string-length (which works now! yay!) and with postgres'
LENGTH() function.
SELECT
(XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
FROM X;
The XPATH() function itself doesn't complain, but libxml does - it expects
UTF-8 encoded data, and screams bloody murder when it encounters the
ISO-8859-1-encoded Umlaut-A
ERROR: could not parse XML document
DETAIL: line 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE4 0x22 0x2F 0x3E
<r a="ä"/>
That might seem fine on the surface - we did, after all, error out instead
of producing potentially non-sensical results. However, libxml's ability to
detect this error relies on it's ability to distinguish between UTF-8 and
non-UTF-8 encoded strings. Which, of course, doesn't work in the general case.
So for my next try, I deliberately set client_encoding to ISO-8859-1, even
though my terminal uses UTF-8, removed all data from table X, and did
INSERT INTO X VALUES ('<r a="ä"/>');
again. The effect is that is that X now contains ISO-8859-1 encoded data
which *happens* to look like valid UTF-8. After changing the client_encoding
back to UTF-8, the value we just inserted looks like that
<r a="ä"/>
Now I invoked the XPATH query from above again.
SELECT
(XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
FROM X;
As predicted, it doesn't raise an error this time, since libxml is unable
to distinguish the ISO-8859-1 string '<r a="ä"/' from valid UTF-8. But the
result is still wrongs, since the string-length() function counts 'ä' as just
one character, when it reality it are of course contains two.
xpath_length | pg_length
--------------+-----------
1 | 2
The easiest way to fix this would be to make XPATH() flat-out refuse to
do anything if the server encoding isn't UTF-8. But that seems a bit harsh -
things actually do work correctly as long as the XML document contains only
ASCII characters, and existing applications might depend on that.
So what I think we should do is tell libxml that the encoding is ASCII
if the server encoding isn't UTF-8. With that change, the query above
produces
ERROR: could not parse XML document
DETAIL: encoder error
which seems sane. Replacing the data in X with ASCII-only data makes the
error go away, and the result is then correct also.
DELETE FROM X;
INSERT INTO X VALUES ('<r a="a"/>');
SELECT
(XPATH('string-length(/r/@a)', d))[1] AS xpath_length,
LENGTH((XPATH('/r/@a', d))[1]::text) AS pg_length
FROM X;
gives
xpath_length | pg_length
--------------+-----------
1 | 1
Proof-of-concept patch attached, but doesn't yet include documentation
updates.
Comments? Thoughts? Suggestions?
best regards,
Florian Pflug
Attachment | Content-Type | Size |
---|---|---|
xpath_nonutf8.patch | application/octet-stream | 1.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Florian Pflug | 2011-07-23 16:46:32 | Re: XPATH vs. server_encoding != UTF-8 |
Previous Message | Tom Lane | 2011-07-23 15:08:52 | Re: [GENERAL] Dropping extensions |