From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Jörg Kiegeland <kiegeland(at)ikv(dot)de> |
Cc: | pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Search for fixed set of keywords |
Date: | 2008-01-09 10:55:06 |
Message-ID: | Pine.LNX.4.64.0801091351440.26876@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Did you try integer arrays with GIN (inverted index) ?
Oleg
On Wed, 9 Jan 2008, J?rg Kiegeland wrote:
> Hello,
>
> I have an interesting generic search task, for which I have done different
> performance tests and I would like to share and discuss my results on this
> newsgroup.
>
> So I begin to describe the search task:
>
> =========
> You have a set of N unique IDs. Every ID is associated with an integer
> scoring value. Every ID is also associated with up to K different keywords
> (there are totally K different keywords K1 ... Kn). Now find the first Z
> best-scored IDs which are associated with a given set of keywords in one of
> two ways:
>
> (C1) The ID must be associated with all keywords of the given set of
> keywords.
> (C2) The ID must be associated with at least one keyword of the given set of
> keywords.
> =========
>
>
> My tests showed that only a Multiple-Column-approach resulted in a acceptable
> query response time. I also tried out an int-array approach using gist, a
> sub-string approach, a bit-column approach, and even a sub-string approach
> using Solr.
> Actually, the int-array approach was 20% faster for Z=infinity, but it became
> linear for the test case [Z=1000 and *all* IDs matches the search condition].
> (To be not misunderstood, "acceptable time" means: having a fixed Z, a fixed
> set of keywords K, a fixed query, and an increasing N, results in constant up
> to logarithmic response time; linear or worser-than-linear time is not
> accepted)
>
> In the Multiple-Column-approach, there is one table. The table has a boolean
> column for each keyword. It has also a column for the ID and for the scoring.
> Now, for each keyword column and for the scoring column a separate index is
> created.
> C1 is implemented by an AND-query on the keyword columns, C2 by and OR query,
> and the result is sorted for the scoring column, cutting of after the first Z
> results.
>
> However our requirements for the search task have changed and I not yet
> managed to find a search approach with acceptable response time for following
> variation:
> Namely that one uses C2 and do not sort for a scoring column but use as
> scoring value the number of matched keywords for a given ID.
> The difficulty in this query type is that the scoring is dependent on the
> query itself..
>
> So has anyone an idea how to solve this query type with acceptable response
> time, or can anybody tell/prove, that this is theoretically not possible?
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Adrian Moisey | 2008-01-09 12:53:09 | Re: big database performance |
Previous Message | Jörg Kiegeland | 2008-01-09 10:43:42 | Search for fixed set of keywords |