From: | Ryan Lambert <ryan(at)rustprooflabs(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Fix XML handling with DOCTYPE |
Date: | 2019-03-16 20:10:56 |
Message-ID: | CAN-V+g-6JqUQEQZ55Q3toXEN6d5Ez5uvzL4VR+8KtvJKj31taw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi all,
I'm investigating the issue I reported here:
https://www.postgresql.org/message-id/flat/153478795159.1302.9617586466368699403%40wrigleys.postgresql.org
As Tom Lane mentioned there, the docs (8.13) indicate xmloption = CONTENT
should accept all valid XML. At this time, XML with a DOCTYPE declaration
is not accepted with this setting even though it is considered valid XML.
I'd like to work on a patch to address this issue and make it work as
advertised.
I traced the source of the error to line ~1500 in
/src/backend/utils/adt/xml.c
res_code = xmlParseBalancedChunkMemory(doc, NULL, NULL, 0, utf8string +
count, NULL);
It looks like it is xmlParseBalancedChunkMemory from libxml that doesn't
work when there's a DOCTYPE in the XML data. My assumption is the DOCTYPE
element makes the XML not well-balanced. From:
http://xmlsoft.org/html/libxml-parser.html#xmlParseBalancedChunkMemory
This function returns:
> 0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise
I see xmlParseBalancedChunkMemoryRecover that might provide the
functionality needed. That function returns:
0 if the chunk is well balanced, -1 in case of args problem and the parser
> error code otherwise In case recover is set to 1, the nodelist will not be
> empty even if the parsed chunk is not well balanced, assuming the parsing
> succeeded to some extent.
I haven't tested yet to see if this parses the data w/ DOCTYPE successfully
yet. If it does, I don't think it would be difficult to update the check
on res_code to not fail. I'm making another assumption that there is a
distinct code from libxml to differentiate from other errors, but I
couldn't find those codes quickly. The current check is this:
if (res_code != 0 || xmlerrcxt->err_occurred)
Does this sound reasonable? Have I missed some major aspect? If this is
on the right track I can work on creating a patch to move this forward.
Thanks,
*Ryan Lambert*
RustProof Labs
www.rustprooflabs.com
From | Date | Subject | |
---|---|---|---|
Next Message | Chapman Flack | 2019-03-16 20:31:18 | Re: Fix XML handling with DOCTYPE |
Previous Message | Heikki Linnakangas | 2019-03-16 20:05:08 | Re: Making all nbtree entries unique by having heap TIDs participate in comparisons |