From: | Marcelo Zabani <mzabani(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Html parsing and inline elements |
Date: | 2016-04-13 15:57:19 |
Message-ID: | CACgY3QavK=P8G-KD6ZRR+M6+y25h+LjicQqp9HYfOiu22GdAFg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, Tom,
You're right, I don't think one can argue that the default parser should
know HTML.
How about your suggestion of there being an HTML parser, is it feasible? I
ask this because I think that a lot of people store HTML documents these
days, and although there probably aren't lots of HTML with words written
along multiple inline elements, it would certainly be nice to have a proper
parser for these use cases.
What do you think?
On Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Marcelo Zabani <mzabani(at)gmail(dot)com> writes:
> > I was here wondering whether HTML parsing should separate tokens that are
> > not separated by spaces in the original text, but are separated by an
> > inline element. Let me show you an example:
>
> > *SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are
> > <strong>n</strong>i<em>ce</em>')*
> > *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*
>
> > "Hello" and "neighbor" should really be separated, because *<p>* is a
> block
> > element, but "nice" should be a single word there, since there is no
> visual
> > separation when rendered (*<em>* and *<strong>* are inline elements).
>
> I can't imagine that we want to_tsvector to know that much about HTML.
> It doesn't, really, even have license to assume that its input *is*
> HTML. So even if you see things that look like <foo> and </foo> in the
> string, it could easily be XML or SGML or some other SGML-like markup
> format with different semantics for the markup keywords.
>
> Perhaps it'd be sane to do something like this as long as the
> HTML-specific behavior was broken out into a separate function.
> (Or maybe it could be done within to_tsvector as a separate parser
> or separate dictionary?) But I don't think it should be part of
> the default behavior.
>
> regards, tom lane
>
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2016-04-13 15:59:42 | Re: Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold < |
Previous Message | Tom Lane | 2016-04-13 15:53:07 | Re: [patch] \crosstabview documentation |