Re: Fix XML handling with DOCTYPE

From: Ryan Lambert <ryan(at)rustprooflabs(dot)com>
To: Chapman Flack <chap(at)anastigmatix(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fix XML handling with DOCTYPE
Date: 2019-03-16 22:43:43
Message-ID: CAN-V+g884QQLJu+guDArhmNMejgb7e5f6b7i1mfTRgHdQFzSQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thank you both! I had glanced at that item in the commitfest but didn't
notice it would fix this issue.
I'll try to test/review this before the end of the month, much better than
starting from scratch myself. A quick glance at the patch looks logical
and looks like it should work for my use case.

Thanks,

Ryan Lambert

On Sat, Mar 16, 2019 at 4:33 PM Chapman Flack <chap(at)anastigmatix(dot)net> wrote:

> On 03/16/19 17:21, Tom Lane wrote:
> > Chapman Flack <chap(at)anastigmatix(dot)net> writes:
> >> On 03/16/19 16:55, Tom Lane wrote:
> >>> What do you think of the idea I just posted about parsing off the
> DOCTYPE
> >>> thing for ourselves, and not letting libxml see it?
> >
> >> The principled way of doing that would be to pre-parse to find a
> DOCTYPE,
> >> and if there is one, leave it there and parse the input as we do for
> >> 'document'. Per XML, if there is a DOCTYPE, the document must satisfy
> >> the 'document' syntax requirements, and per SQL/XML:2006-and-later,
> >> 'content' is a proper superset of 'document', so if we were asked for
> >> 'content' and can successfully parse it as 'document', we're good,
> >> and if we see a DOCTYPE and yet it incurs a parse error as 'document',
> >> well, that's what needed to happen.
> >
> > Hm, so, maybe just
> >
> > (1) always try to parse as document. If successful, we're done.
> >
> > (2) otherwise, if allowed by xmloption, try to parse using our
> > current logic for the CONTENT case.
>
> What I don't like about that is that (a) the input could be
> arbitrarily long and complex to parse (not that you can't imagine
> a database populated with lots of short little XML snippets, but
> at the same time, a query could quite plausibly deal in yooge ones),
> and (b), step (1) could fail at the last byte of the input, followed
> by total reparsing as (2).
>
> I think the safer structure is clearly that of the current patch,
> modulo whether the "has a DOCTYPE" test is done by libxml itself
> (with the assumptions you don't like) or by a pre-scan.
>
> So the current structure is:
>
> restart:
> asked for document?
> parse as document, or fail
> else asked for content:
> parse as content
> failed?
> because DOCTYPE? restart as if document
> else fail
>
> and a pre-scan structure could be very similar:
>
> restart:
> asked for document?
> parse as document, or fail
> else asked for content:
> pre-scan finds DOCTYPE?
> restart as if document
> else parse as content, or fail
>
> The pre-scan is a simple linear search and will ordinarily say yes or no
> within a couple dozen characters--you could *have* an input with 20k of
> leading whitespace and comments, but it's hardly the norm. Just trying to
> parse as 'document' first could easily parse a large fraction of the input
> before discovering it's followed by something that can't follow a document
> element.
>
> Regards,
> -Chap
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2019-03-16 22:54:30 Re: proposal: pg_restore --convert-to-text
Previous Message Chapman Flack 2019-03-16 22:33:19 Re: Fix XML handling with DOCTYPE