Re: Search for fixed set of keywords

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Jörg Kiegeland <kiegeland(at)ikv(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Search for fixed set of keywords
Date: 2008-01-09 10:55:06
Message-ID: Pine.LNX.4.64.0801091351440.26876@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Did you try integer arrays with GIN (inverted index) ?

Oleg
On Wed, 9 Jan 2008, J?rg Kiegeland wrote:

> Hello,
>
> I have an interesting generic search task, for which I have done different
> performance tests and I would like to share and discuss my results on this
> newsgroup.
>
> So I begin to describe the search task:
>
> =========
> You have a set of N unique IDs. Every ID is associated with an integer
> scoring value. Every ID is also associated with up to K different keywords
> (there are totally K different keywords K1 ... Kn). Now find the first Z
> best-scored IDs which are associated with a given set of keywords in one of
> two ways:
>
> (C1) The ID must be associated with all keywords of the given set of
> keywords.
> (C2) The ID must be associated with at least one keyword of the given set of
> keywords.
> =========
>
>
> My tests showed that only a Multiple-Column-approach resulted in a acceptable
> query response time. I also tried out an int-array approach using gist, a
> sub-string approach, a bit-column approach, and even a sub-string approach
> using Solr.
> Actually, the int-array approach was 20% faster for Z=infinity, but it became
> linear for the test case [Z=1000 and *all* IDs matches the search condition].
> (To be not misunderstood, "acceptable time" means: having a fixed Z, a fixed
> set of keywords K, a fixed query, and an increasing N, results in constant up
> to logarithmic response time; linear or worser-than-linear time is not
> accepted)
>
> In the Multiple-Column-approach, there is one table. The table has a boolean
> column for each keyword. It has also a column for the ID and for the scoring.
> Now, for each keyword column and for the scoring column a separate index is
> created.
> C1 is implemented by an AND-query on the keyword columns, C2 by and OR query,
> and the result is sorted for the scoring column, cutting of after the first Z
> results.
>
> However our requirements for the search task have changed and I not yet
> managed to find a search approach with acceptable response time for following
> variation:
> Namely that one uses C2 and do not sort for a scoring column but use as
> scoring value the number of matched keywords for a given ID.
> The difficulty in this query type is that the scoring is dependent on the
> query itself..
>
> So has anyone an idea how to solve this query type with acceptable response
> time, or can anybody tell/prove, that this is theoretically not possible?
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Adrian Moisey 2008-01-09 12:53:09 Re: big database performance
Previous Message Jörg Kiegeland 2008-01-09 10:43:42 Search for fixed set of keywords