Re: BUG #17691: Unexpected behaviour using ts_headline()

From: sebastian(dot)patino-lang(at)posteo(dot)net
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17691: Unexpected behaviour using ts_headline()
Date: 2022-11-22 06:34:21
Message-ID: 6702b4c1-457f-4e34-a844-bb9cfe4c7e6a@Spark
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Tom,

thanks for the in-depth explanation. Makes a bit more sense now. Not sure what to do with the attached patch though. Should I use that to compile my own version of PG? Since I use a managed version thats not an option currently.

Regards

Sebastian
On 20. Nov 2022, 20:30 +0100, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, wrote:
> PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> > I experience unexpected behaviour when using ts_headline() in general, but
> > especially when changing MaxFragments. Given the data in
> > ts_headline_report.sql [1]
>
> Thanks for sending the sample data. After poking at this a bit, it
> seems the question boils down to what ts_headline should do when there
> simply aren't any short matches to the query. Part of the issue there
> is what qualifies as "short". The definition I installed in 78e73e875
> was
>
> + /*
> + * We might eventually make max_cover a user-settable parameter, but for
> + * now, just compute a reasonable value based on max_words and
> + * max_fragments.
> + */
> + max_cover = Max(max_words * 10, 100);
> + if (max_fragments > 0)
> + max_cover *= max_fragments;
>
> Looking at this again, I think the dependency on max_fragments is just
> wrong. That is what is causing your different results for different
> settings of MaxFragments: larger values allow recognition of longer
> candidate covers ("cover" being jargon in this code for "a substring
> satisfying the query"). But there's no good reason for that. I'd misread
> the logic in mark_hl_fragments, which "breaks the cover into smaller
> fragments such that each fragment has at most max_words", to think that
> it needed to see longer covers that it could divide into fragments.
> But actually max_fragments limits the number of fragments it will return
> across all covers, and there doesn't seem to be a reason to connect that
> parameter to the acceptable length of any particular cover. So dropping
> those two lines is enough to make the strange behavior across differing
> MaxFragments go away.
>
> The other issue is what to do after finding there's no match of max_cover
> or fewer tokens. As the code now stands, it'll just give up and return
> the initial bit of the text, likely with no query words. That's not
> great. I propose that we look for any match and return (highlight) just
> its first word.
>
> > SELECT id,
> > ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
> > world'), 'StartSel=<<, StopSel=>>') AS "preview"
> > FROM texts;
>
> > id=1: Highlight word is the first one in the result. Expectation: highlight
> > word is somewhere in the middle.
>
> Meh --- I don't agree with that expectation. The fragment-based code does
> act that way, but the non-fragment path never has, so I think we'd make
> more people unhappy than happy if we change its behavior.
>
> > id=2: No highlight word at all.
>
> Right, because there are no short matches to "amazon & world" anywhere.
> With the attached proposed patch, we'll highlight whichever one of those
> words appears first.
>
> > id=3: Highlight words are the first and last one in the result. Not ideal
> > but ok-ish.
>
> As I said, that's how non-fragment highlighting has always worked. It
> will extend the quoted text, but not beyond min_words, which is a bit
> different from fragment highlighting.
>
> > SELECT id,
> > ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
> > world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
> > FROM texts;
>
> > id=1: Wrong number of fragments (2) with highlight words are returned.
>
> I don't agree with this expectation either. There's only one reasonably
> sane match to 'amazon & world' in that text, and it's near the beginning.
> As the code stands, increasing max_fragments eventually allows it to find
> some less-sane matches, but that doesn't seem like an improvement.
>
>
> In short, I suggest something about like the attached. For your id=2
> example, which has only very far-apart occurrences of 'amazon' and
> 'world', this produces
>
> <<world>> where fast runtimes make or break the popularity of research
> fields, this oversight has effectively
>
> without fragmenting, and
>
> researchers make the effort of understanding the inner workings of
> these convenient black-boxes. In a <<world>> where fast runtimes make
> or break the popularity of research fields, this oversight has
> effectively surrendered most
>
> with fragmenting, which is okay by me given their historically different
> behaviors. Otherwise, it gets rid of the strange changes in matching
> behavior for different MaxFragment values.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2022-11-22 08:48:12 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message Chris Rohlfs 2022-11-22 05:06:47 Sort Order inconsistent when using Grouping Sets Rollup