From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Ivan Sergio Borgonovo <mail(at)webthatworks(dot)it> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: FTS uses "tsquery" directly in the query |
Date: | 2010-01-25 20:35:12 |
Message-ID: | Pine.LNX.4.64.1001252334290.16860@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Do you guys wanted something like:
arxiv=# select and2or(to_tsquery('1 & 2 & 3'));
and2or
---------------------
( '1' | '2' ) | '3'
(1 row)
Oleg
On Mon, 25 Jan 2010, Ivan Sergio Borgonovo wrote:
> On Mon, 25 Jan 2010 07:19:59 -0800 (PST)
> xu fei <autofei(at)yahoo(dot)com> wrote:
>
> > Hi, Oleg Bartunov:
> > First thanks for your quick replay.=C2=A0Could you explain it a little
> > more on "it's general limitation/feature"? I just confuse that
> > to_tsquery('item') function will return a tsquery type which is
> > same as 'item'::tsquery, to my understanding. Let me explain what
> > I want:First Step: extract top K tokensI have a table with a
> > column as tsvector type. Some records in this column are too big,
> > which contain hundreds tokens. I just want the top K tokens based
> > on the frequency, for example top 5. I am not sure there is a
> > direct way to get such kind top K tokens. I just read them out in
> > Java and count frequency for each token and sort them. Second
> > Step: generate queryNow I will use these tokens to construct a
> > query to search other vectors in the same table. I can not
> > directly use to_tsquery() due to two reasons: 1) The default logic
> > operator in to_tsquery() is "&" but what I need it "|". 2) Since
> > the tokens are from tsvector, they are already normalized. If I
> > use to_tsquery() again, they will be normalized again! For
> > example, =E2=80=9Ccourse=E2=80=9D -> =E2=80=9Ccours=E2=80=9D -> =E2=80=9C=
> cour=E2=80=9D. So I just concatenate the
> > top K tokens with =E2=80=9C|=E2=80=9D and directly use "::tsquery ".
> > Unfortunately, as you say "it's general limitation/feature=E2=80=9D, I can
> > not do that. I checked your manual =E2=80=9CFull-Text Search
> > in=C2=A0PostgreSQL=C2=A0A Gentle Introduction=E2=80=9D, but could not fig=
> ure out
> > how. So is it possible to implement what I want in FTS? If so,
> > how? Thank! Xu --- On Sun, 1/24/10, Oleg Bartunov
>
> You're trying to solve a similar problems than mine.
> I'd like to build up a n=C3=A4ive similar text search.
> I don't have the "length" problem still I'd like to avoid to
> tokenize/lexize a text twice to build up a tsquery.
> I've weighted tsvectors stored in a column and once I pick up one
> I'd like to look for similar ones in the same column.
>
> There are thousands way to measure text similarity (and Oleg pointed
> me to some), still ts_rank should be "good enough for me".
>
> I've very short text so I can't use & on the whole tsvector
> otherwise there will be very high chances to find just one match.
>
> As you suggested I could pick up a subset of "important"[1] lexemes
> in the tsvector and build up an "&"ed tsquery with them.
>
> Still at least in my case, since I'm dealing with very short texts,
> this still looks too risky (just 1 match). Considering that I'm
> using weighted tsvectors it seems that "|" and picking up the ones
> with the best rank could be a way to go.
>
> But as you've noted there is no function that turns a tsvector in a
> tsquery (including weight possibly) and give you the choice to use
> "|".
>
> Well... I'm trying to write a couple of helper functions in C.
> But I'm pretty new to postgres internals and well I miss a reference
> of functions/macro with some examples... and this is a side project
> and I haven't been using C for quite a while.
>
> Once I'll have that function I'll have to solve how to return few
> rows (since I'll have to use | I expect a lot of returned rows) to
> make efficient use of the gin index and avoid to compute ts_rank for
> too many rows.
>
> Don't hold your breath waiting... but let me know if you're
> interested so I don't have to be the only one posting newbies
> questions on pgsql-hackers ;)
>
> [1] ts_stat could give you some hints about what lexemes may be
> important... but well deciding what's important is another can of
> worms... and as anticipated ts_rank should be "good enough for me".
>
> --=20
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --=20
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Ivan Sergio Borgonovo | 2010-01-25 21:33:27 | Re: FTS uses "tsquery" directly in the query |
Previous Message | Richard Broersma | 2010-01-25 18:53:55 | Scale 8X Exhibition Booth Attendants |