From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Noah Misch <noah(at)leadboat(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: XML with invalid chars |
Date: | 2011-05-11 22:17:07 |
Message-ID: | 4DCB0AE3.5020503@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 05/09/2011 11:25 PM, Noah Misch wrote:
>
> I see you've gone with doing it unconditionally. I'd lean toward testing the
> library in pg_xml_init and setting a flag indicating whether we need the extra
> pass. However, a later patch can always optimize that.
>
I wasn't terribly keen on the idea, but we can look at it again later.
>>
>> Please review and try to break.
> Here are the test cases I tried:
>
> -- caught successfully
> SELECT E'\x01'::xml;
> SELECT xmlcomment(E'\x01');
> SELECT xmlelement(name foo, xmlattributes(E'\x01' AS bar), '');
> SELECT xmlelement(name foo, NULL, E'\x01');
> SELECT xmlforest(E'\x01' AS foo);
> SELECT xmlpi(name foo, E'\x01');
> SELECT query_to_xml($$SELECT E'\x01'$$, true, false, '');
>
> -- not caught
> SELECT xmlroot('<root/>', version E'\x01');
That's an easy fix.
> SELECT xmlcomment(E'\ufffe');
That's a bit harder. Do we want to extend these checks to cover
surrogates and end of plane characters, which are the remaining
forbidden chars? It certainly seems likely to be a somewhat slower test
since I think we'd need to process the input strings a Unicode char at a
time, but we do that in other places and it seems acceptable. What do
people think?
> -- not directly related, but also wrongly accepted
> SELECT xmlroot('<root/>', version ' ');
> SELECT xmlroot('<root/>', version 'foo');
>
> Offhand, I don't find libxml2's handling of XML declarations particularly
> consistent. My copy's xmlCtxtReadDoc() API (used by xml_in when xmloption =
> document) accepts '<?xml version="foo"?>' but rejects'<?xml version=" "?>'.
> Its xmlParseBalancedChunkMemory() API (used by xml_in when xmloption = content)
> accepts anything, even control characters. The XML 1.0 standard is stricter:
> the version must match ^1\.[0-9]+$. We might want to tighten this at the same
> time.
We can add some stuff to check the version strings. Doesn't seem
terribly difficult.
> libxml2's error message for this case is "PCDATA invalid Char value 1"
> (assuming \x01). Mentioning PCDATA seems redundant, since no other context
> offers greater freedom. How about:
>
> ereport(ERROR,
> (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
> errmsg("invalid XML 1.0 Char \\U%08x", char_val)));
>
>
That would also mean processing the string a unicode char at a time. So
maybe that's what we need to do.
Thanks for the input.
cheers
andrew
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2011-05-11 22:21:23 | Re: performance-test farm |
Previous Message | Alvaro Herrera | 2011-05-11 22:05:27 | Re: Prefered Types |