Re: possible encoding issues with libxml2 functions

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: possible encoding issues with libxml2 functions
Date: 2017-08-18 21:43:19
Message-ID: CAFj8pRBx5rqUvssw+TvhTLoqeTxQtL1KV7OaMt1joHHp1pOwqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

2017-08-08 2:10 GMT+02:00 Noah Misch <noah(at)leadboat(dot)com>:

> On Wed, Apr 05, 2017 at 10:53:39PM +0200, Pavel Stehule wrote:
> > 2017-03-17 4:23 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
> > > On Sun, Mar 12, 2017 at 10:26:33PM +0100, Pavel Stehule wrote:
> > > > 2017-03-12 21:57 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
> > > > > On Sun, Mar 12, 2017 at 08:36:58PM +0100, Pavel Stehule wrote:
> > > > > > 2017-03-12 0:56 GMT+01:00 Noah Misch <noah(at)leadboat(dot)com>:
> > > > > Please add a test case.
> >
> > I am sending test case.
> >
> > I am afraid so this cannot be part of regress tests due strong dependency
> > on locale win1250.
>
> Right. Based on that, I've distilled this portable test case:
>
> -- Round-trip non-ASCII data through xpath().
> DO $$
> DECLARE
> xml_declaration text := '<?xml version="1.0"
> encoding="ISO-8859-1"?>';
> degree_symbol text;
> res xml[];
> BEGIN
> -- Per the documentation, xpath() doesn't work on non-ASCII data
> when
> -- the server encoding is not UTF8. The EXCEPTION block below,
> -- currently dead code, will be relevant if we remove this
> limitation.
> IF current_setting('server_encoding') <> 'UTF8' THEN
> RAISE LOG 'skip: encoding % unsupported for xml',
> current_setting('server_encoding');
> RETURN;
> END IF;
>
> degree_symbol := convert_from('\xc2b0', 'UTF8');
> res := xpath('text()', (xml_declaration ||
> '<x>' || degree_symbol || '</x>')::xml);
> IF degree_symbol <> res[1]::text THEN
> RAISE 'expected % (%), got % (%)',
> degree_symbol, convert_to(degree_symbol, 'UTF8'),
> res[1], convert_to(res[1]::text, 'UTF8');
> END IF;
> EXCEPTION
> -- character with byte sequence 0xc2 0xb0 in encoding "UTF8" has
> no equivalent in encoding "LATIN8"
> WHEN untranslatable_character THEN RAISE LOG 'skip: %', SQLERRM;
> -- default conversion function for encoding "UTF8" to
> "MULE_INTERNAL" does not exist
> WHEN undefined_function THEN RAISE LOG 'skip: %', SQLERRM;
> END
> $$
>

yes, probably libXML2 try to do check from utf8 encoding to header
specified encoding.

> > > > > Why not use xml_parse() instead of calling xmlCtxtReadMemory()
> > > > > directly? The answer is probably in the archives, because someone
> > > > > understood the problem enough to document "Some XML-related
> functions
> > > > > may not work at all on non-ASCII data when the server encoding is
> not
> > > > > UTF-8. This is known to be an issue for xpath() in particular."
> > > >
> > > > Probably there are two possible issues
> > >
> > > Would you research in the archives to confirm?
> > >
> > > > 1. what I touched - recv function does encoding to database encoding
> -
> > > > but document encoding is not updated.
> > >
> > > Using xml_parse() would fix that, right?
> > >
> >
> > It should to help, because it cuts a header - but it does little bit more
> > work, than we would - it check if xml is doc or not too.
>
> That's true. xpath() currently works on both DOCUMENT and CONTENT xml
> values.
> If xpath() used xml_parse(), this would stop working:
>
> SELECT xpath('text()', XMLPARSE (DOCUMENT '<!DOCTYPE foo
> []><x>bar</x>'));
>
> > One possible fix - and similar technique is used more times in xml.c code
> > .. xmlroot
>
> > + /* remove header */
> > + parse_xml_decl(string, &header_len, NULL, NULL, NULL);
>
> > - doc = xmlCtxtReadMemory(ctxt, (char *) string, len, NULL, NULL,
> 0);
> > + doc = xmlCtxtReadMemory(ctxt, (char *) string + header_len, len -
>
> > another possibility is using xml_out_internal - that is used in XMLTABLE
> > function - what was initial fix.
> >
> > xml_out_internal uses parse_xml_decl too, but it is little bit more
> > expensive due call print_xml_decl that has not any effect in this case
> > (where only encoding is interesting) - but it can has sense, when server
> > encoding is not UTF8, because in this case, xmlstr is not encoded to
> UTF8 -
> > so now, I am think the correct solution should be using xml_out_internal
> -
> > because it appends a header with server encoding to XML doc
>
> As you may have been implying here, your patch never adds an xml
> declaration
> that bears an encoding. (That is because it passes targetencoding=0 to
> xml_out_internal().) If we were going to fix xpath() to support non-ASCII
> characters in non-UTF8 databases, we would not do that by adding xml
> declarations. We would do our own conversion to UTF8, like xml_parse()
> does.
> In that light, I like this parse_xml_decl() approach better. Would you
> update
> your patch to use it and to add the test case I give above?
>
>
I need to do some recapitulation (my analyses was wrong):

a) all values created by xml_in iterface are in database encoding - input
string is stored without any change. xml_parse is called only due
validation.

b) inside xml_parse, the input is converted to UTF8, and document is read
by xmlCtxtReadDoc with explicitly specified "UTF-8" encoding or
by xmlParseBalancedChunkMemory with explicitly specified encoding "UTF8"
and removed decl section.

So for "xml_parse" based functions (xml_in, texttoxml, xml_is_document,
wellformated_xml) the database encoding is not important

c) xml_recv function does validation by xml_parse and translation to
database encoding.

Now I don't see a difference between @b and @c - so my hypotheses about
necessity to use recv interface was wrong.

So Looks like libXML2 behave - when we push document with encoding decl,
then this library expects document in this encoding

So test case is simple

-- should to work
select xpath('/enprimeur/vino/id'/,
'<enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');

-- should to fail
select xpath('/enprimeur/vino/id'/, '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');

I didn't expect this error on recv API, because we do implicit encoding to
database encoding - and I expected implicit removing of encoding
declaration. The problematic char is ok, the issue is different length

Other functions is working - so I have a expectation so xpath should to work

postgres=# select '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'::xml;
┌────────────────────────────────────────────────────────────────────┐
│ xml │
╞════════════════════════════════════════════════════════════════════╡
│ <enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur> │
└────────────────────────────────────────────────────────────────────┘
(1 row)

postgres=# select '<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>'
is document
postgres-# ;
┌──────────┐
│ ?column? │
╞══════════╡
│ t │
└──────────┘
(1 row)
postgres=# select xml_is_well_formed_document('<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌─────────────────────────────┐
│ xml_is_well_formed_document │
╞═════════════════════════════╡
│ t │
└─────────────────────────────┘
(1 row)

postgres=# select xml_is_well_formed_content('<?xml version="1.0"
encoding="windows-1250"?><enprimeur><vino><id>909</id><remark>ň</remark></vino></enprimeur>');
┌────────────────────────────┐
│ xml_is_well_formed_content │
╞════════════════════════════╡
│ t │
└────────────────────────────┘
(1 row)

d) XMLEXISTS doesn't work too .. because it share code wit xpath function

This change can make things worse in a non-UTF8 database. The following
> happens to work today in a LATIN1 database, but it will cease to work:
>
> SELECT xpath('string-length()', ('<?xml version="1.0"
> encoding="ISO-8859-1"?>' ||
> '<x>' || chr(176) || '</x>')::xml);
>
> However, extracting actual text already gives wrong answers, because we
> treat
> the UTF8 response from libxml2 as though it were LATIN1:
>
> SELECT xpath('string()', ('<?xml version="1.0" encoding="ISO-8859-1"?>' ||
> '<x>' || chr(176) || '</x>')::xml);
>
> I plan to back-patch this. The documentation says "Some XML-related
> functions
> may not work at all on non-ASCII data when the server encoding is not
> UTF-8. This is known to be an issue for xpath() in particular." Your patch
> fixes a case where even a UTF8 database mishandles non-ASCII data.
> (Sometimes
> it mishandles the data silently. Other times, it raises an error.) It's
> worth further breaking a case where we already disclaim support to repair a
> supported case. Other opinions?
>

Isn't the most correct solution to call xml_parse function?

Regards

Pavel

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Steele 2017-08-18 22:46:12 Re: Update low-level backup documentation to match actual behavior
Previous Message Douglas Doole 2017-08-18 20:08:28 Re: [PATCH] Push limit to sort through a subquery