Re: tsearch2: plainto_tsquery() with OR?

From: "Mike Rylander" <mrylander(at)gmail(dot)com>
To: cluster <skrald(at)amossen(dot)dk>
Cc: "Oleg Bartunov" <oleg(at)sai(dot)msu(dot)su>, tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-general(at)postgresql(dot)org
Subject: Re: tsearch2: plainto_tsquery() with OR?
Date: 2007-08-20 01:05:17
Message-ID: b918cf3d0708191805t7a7b96d7n6d81fd63c948d37d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 8/9/07, cluster <skrald(at)amossen(dot)dk> wrote:
> Thanks for your response! Let me try to elaborate what I meant with my
> original post.
>
> If R is the set of words in the tsvector for a given table row and S is
> the set of keywords to search for (entered by e.g. a website user) I
> would like to receive all rows for which the intersection between R and
> S is nonempty. That is: The row should be return if just there is SOME
> match. S does not necessarily need to be a subset of R.
>

You could just wrap up a simple query in an SQL function called
plainto_or_tsquery or the like.

CREATE OR REPLACE FUNCTION plainto_or_tsquery (TEXT) RETURNS tsquery AS $$
SELECT to_tsquery( regexp_replace( $1, E'[\\s\'|:&()!]+','|','g') );
$$ LANGUAGE SQL STRICT IMMUTABLE;

Paste this into a PG database that has tsearch2 loaded (after creating
the above function, of course):

select
rank_cd(to_tsvector('hi . there web 2.0'), plainto_or_tsquery('hello
. web 2.0')),
to_tsvector('hi . there web 2.0') @@ plainto_or_tsquery('hello . web
2.0') as matches;

> Furthermore I would like a measure for how "nonempty" the intersection
> is (we would call this measure "the rank").
> Example:
> For R = "three big houses" and S = "three small houses" the rank should
> be higher than for R = "three big houses" and S = "four small houses" as
> the first case has two words in common while the second case has only one.

Both rank() and rank_cd() work fine for all-ORed queries, full match
or otherwise. The more "matchy", the better the rank.

>
> A version of plainto_tsquery() with a simple OR operator instead of AND
> would solve this problem somewhat elegant:
> 1) I can now use the conventional "tsvector @@ tsquery" syntax in my
> WHERE clause as the "@@" operator will return true and thus include the
> row in the result. Example:
> select to_tsvector('simple', 'three small houses')
> @@ 'four|big|houses'::tsquery;
> would return "true".
>

Um... it does.

forge=# select to_tsvector('simple', 'three small houses') @@
'four|big|houses'::tsquery;
?column?
----------
t
(1 row)

> 2) The rank() of the @@ operator is automatically higher when there is a
> good match.
>

Again, that's already the case.

forge=# select rank(to_tsvector('hi . there web 2.0'),
plainto_or_tsquery('hello . web 2.0')), rank(to_tsvector('hi . there
web 2.0'), plainto_or_tsquery('hi . web 2.0'));
rank | rank
-----------+-----------
0.0405285 | 0.0607927
(1 row)

The second is a better match; "hi" vs "hello" in the queries.

>
> An example where this OR-version of plainto_tsquery() could be useful is
> for websites using tags. Each website entry is associated with some tags
> and each user has defined some "tags of interest". The search should
> then return all website entries where there is a match (not necessarily
> complete) with the users tags of interest. Of course the best matching
> entries should be displayed top most.
>

See above. Though, again, you'd need to put in a little work to make
sure everything is completely protected. Probably less time than it's
take you to discuss this so far, though. And you'd want to create a
2-param version that could accept the correct tsearch2 config.

>
> I find it important that this function is a part of tsearch2 itself as:
> 1) The user can input arbitrary data. Also potentially harmful data if
> they are not escaped right.

That's not tsearch2's problem in particular. You should be using
parameterized queries in your app (or applying the correct quoting
functions) for all data, not just directly user supplied strings. All
data is user supplied at some level.

> 2) Special characters should be stripped in just the same way as
> to_tsvector() does it. E.g. stripping the dot in "Hi . there" but
> keeping it in "web 2.0". Only tsearch2 can do that in a clean consistent
> way - it would be fairly messy if some thirdparty or especially some
> website-developer-homecooked stripping functionality is used for this.
>

The simple example above uses to_tsquery to do that. Try it out, and
if you improve it please feel free to share.

On a more general note, IMO this should not be getting in the way of
integrating tsearch2 into core. The example above shows how trivial
it is to do the "simple" thing, but that's probably not going to be
the "right" thing. Any time you find yourself forcing your user to is
"x OR y OR z" to get a result is a time you should probably be
augmenting your tsvectors (or tsquerys now, with query rewriting) with
thesauri.

BTW, Google /does/ just AND everything. They just don't tell you
about everything they're ANDing.

All this is, of course, disregarding the actual utility of an all-OR
query (not much in practice, IME) and the speed of such queries on
non-trivial datasets (not good -- essentially a scan per ORed
component)).

Anyway, I hope that helps...

--miker

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2007-08-20 03:30:24 Re: Searching for Duplicates and Hosed the System
Previous Message Gregory Stark 2007-08-19 22:08:56 Re: POSTGRE CRASH AND CURRVAL PROBLEM HELP!