From: | daniel <dochtorek(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: ts_headline and query with hyphen |
Date: | 2012-12-05 04:42:26 |
Message-ID: | 50BED0B2.5070204@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 12/05/2012 04:49 AM, Tom Lane wrote:
> daniel <dochtorek(at)gmail(dot)com> writes:
>> I have a question about ts_headline, when the query includes word like
>> 'on-line' - only the 'line' part is highlighted, even though the whole
>> phrase is indexed too, some details below.
>
> Part of the reason is that "on" is a stop word (at least in the default
> english dictionary). That's why you get
>
>> select to_tsquery('play & on-line');
>> to_tsquery
>> ----------------------------
>> 'play' & 'on-lin' & 'line'
>
> and not "'play' & 'on-lin' & 'on' & 'line'". If you did get the latter
> then you'd get a headline result with both parts highlighted, similar to
> your "custom-built" case.
>
I understand the 'on' part, but still, 'on-lin' is passed to the
ts_headline, so I thought that match would be preferred over 'line' and
highlighted as a whole.
Additionally, with a specific value of MaxWords I could see a dangling
"line" at the start of a headline ("on-" has been cut off), which is
kinda troubling, because it's not even an English document. It doesn't
seem to happen to queries like 'custom-built' - I can't see it being
split neither in the beginning of a headline nor at the end.
Just to be clear - the headline with cut off "on-" is OK (having the
matched stuff somewhere in the middle, though with highlighted 'line'
only), it's just that the word 'on-line' is used multiple times in the
doc and it happended to appear at the beginning of a headline. Cutting
was not affected by ShortWord setting, so I guess it's a stopword thing
again. If that's the case, then IMHO it should treat hyphenated words as
1 when creating the headline and not cut off like that. But maybe it was
intended to work like that..
>> But maybe ts_headline understands or operates on
>> single, not hyphenated words only?
>
> Dunno. It would seem reasonable to highlight the whole compound in
> these cases, but I have no idea how hard that is.
>
Right, although that latter case is easy to fix outside postgres and
still looks fine - I've included it just as an example. Former causes a
few problems in specific cases, I have to fix them manually now, word by
word.
> Another thing that seems a bit odd here is that we seem to be stemming
> the compound word as a whole, but not the individual parts. Not sure
> how sane that combination of choices is ...
>
Good question, hope others will jump in.
thanks,
daniel
From | Date | Subject | |
---|---|---|---|
Next Message | Edson Richter | 2012-12-05 04:44:39 | Table with million rows - and PostgreSQL 9.1 is not using the index |
Previous Message | Chris Angelico | 2012-12-05 04:41:28 | Re: how do I grant select to one user for all tables in a DB? |