From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | obartunov(at)gmail(dot)com |
Cc: | Jean-Pierre Pelletier <jppelletier(at)e-djuster(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Subject: | Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ? |
Date: | 2016-06-08 21:44:11 |
Message-ID: | 11252.1465422251@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Oleg Bartunov <obartunov(at)gmail(dot)com> writes:
> On Wed, Jun 8, 2016 at 1:05 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I concur that that seems like a rather useless behavior. If we have
>> "x <-> y" it is not possible to match at distance zero, while if we
>> have "x <-> x" it seems unlikely that the user is expecting us to
>> treat that identically to "x". So phrase search simply should not
>> consider distance-zero matches.
> what's about word with several infinitives
> select to_tsvector('en', 'leavings');
> to_tsvector
> ------------------------
> 'leave':1 'leavings':1
> (1 row)
> select to_tsvector('en', 'leavings') @@ 'leave <0> leavings'::tsquery;
> ?column?
> ----------
> t
> (1 row)
Hmm. I can grant that there might be some cases where you want to see
if two separate patterns match the same lexeme, but that seems like an
extremely specialized use-case that you would only invoke very
intentionally. It should not be built in as part of the default behavior
of every phrase search, because 99% of the time this would be an
unexpected and unwanted match. I'm not even convinced that the operator
for this should be spelled <0> --- that seems more like a hack than a
natural extension of phrase search. But if we do spell it like that,
then I think it should be called out as a special case that only applies
to <0>; that is, for any other value of N, the match has to be to separate
lexemes.
This brings up something else that I am not very sold on: to wit,
do we really want the "less than or equal" distance behavior at all?
The documentation gives the example that
phraseto_tsquery('cat ate some rats')
produces
( 'cat' <-> 'ate' ) <2> 'rat'
because "some" is a stopword. However, that pattern will also match
"cat ate rats", which seems surprising and unexpected to me; certainly
it would surprise a user who did not realize that "some" is a stopword.
So I think there's a reasonable case for decreeing that <N> should only
match lexemes *exactly* N apart. If we did that, we would no longer have
the misbehavior that Jean-Pierre is complaining about, and we'd not need
to argue about whether <0> needs to be treated specially.
Or maybe we need two operators, one for exactly-N-apart and one for
at-most-N-apart.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2016-06-08 21:47:57 | Re: Should phraseto_tsquery('simple', 'blue blue') @@ to_tsvector('simple', 'blue') be true ? |
Previous Message | Bruce Momjian | 2016-06-08 21:36:08 | Re: Use of index for 50% column restriction |