Re: XML with invalid chars

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: XML with invalid chars
Date: 2011-05-11 22:17:07
Message-ID: 4DCB0AE3.5020503@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/09/2011 11:25 PM, Noah Misch wrote:
>
> I see you've gone with doing it unconditionally. I'd lean toward testing the
> library in pg_xml_init and setting a flag indicating whether we need the extra
> pass. However, a later patch can always optimize that.
>

I wasn't terribly keen on the idea, but we can look at it again later.

>>
>> Please review and try to break.
> Here are the test cases I tried:
>
> -- caught successfully
> SELECT E'\x01'::xml;
> SELECT xmlcomment(E'\x01');
> SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
> SELECT xmlelement(name foo, NULL, E'\x01');
> SELECT xmlforest(E'\x01' AS foo);
> SELECT xmlpi(name foo, E'\x01');
> SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');
>
> -- not caught
> SELECT xmlroot('<root/>', version E'\x01');

That's an easy fix.

> SELECT xmlcomment(E'\ufffe');

That's a bit harder. Do we want to extend these checks to cover
surrogates and end of plane characters, which are the remaining
forbidden chars? It certainly seems likely to be a somewhat slower test
since I think we'd need to process the input strings a Unicode char at a
time, but we do that in other places and it seems acceptable. What do
people think?

> -- not directly related, but also wrongly accepted
> SELECT xmlroot('<root/>', version ' ');
> SELECT xmlroot('<root/>', version 'foo');
>
> Offhand, I don't find libxml2's handling of XML declarations particularly
> consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
> document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'.
> Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
> accepts anything, even control characters. The XML 1.0 standard is stricter:
> the version must match ^1\.[0-9]+$. We might want to tighten this at the same
> time.

We can add some stuff to check the version strings. Doesn't seem
terribly difficult.

> libxml2's error message for this case is "PCDATA invalid Char value 1"
> (assuming \x01). Mentioning PCDATA seems redundant, since no other context
> offers greater freedom. How about:
>
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> errmsg("invalid XML 1.0 Char \\U%08x", char_val)));
>
>

That would also mean processing the string a unicode char at a time. So
maybe that's what we need to do.

Thanks for the input.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2011-05-11 22:21:23 Re: performance-test farm
Previous Message Alvaro Herrera 2011-05-11 22:05:27 Re: Prefered Types