Quick Links

Re: Proposal: q-gram GIN and GiST indexes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: q-gram GIN and GiST indexes
Date:	2011-04-05 11:56:07
Message-ID:	BANLkTi=dmsdDzBEZhVtNnaY11cuvVQ-h-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Apr 5, 2011 at 4:52 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com> wrote:
>> > relatively small when q <= 5. Accordingly, I think we should expect
>> > indexes
>> > to be usable with at least with q = 5.
>>
>> I defer to your opinion on this, since you know more about it than I
>> do. But I think it would still be worthwhile to write a quick Perl
>> script and calculate the number q-grams in various sample texts for
>> various values of q. The worst case is surely exponential in q, so
>> it'd be nice to have some evidence of what the real-world behavior is.
>
> Here is distribution of numbers of different q-grams count in various
> datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for
> example, lowercased) should have lower counts.
> q ds1 ds2 ds3 ds4
> 2 2313 3461 1625   1288
> 3 15146   25094   14090 10728
> 4 58510 105908   69127 47499
> 5 161801 298466 182680 110929
> 6 351175 633750 331090 176336
> 7 613299 1049088 496426 234730
> 8 921962 1450715 657965 283698
> 9 1248339 1793158 802188 321261
> 10 1556838 2066775 926043 348058
> ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
> ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
> ds3 - set of person first and last names, 2142298 bytes
> ds4 - english dictionary, 931708 bytes
> Sure, q-grams count grows with q increasing. At low q we can see
> approximately exponential grow. At high q grow is slowing and it is
> approximately linear.
> In the worst case count of q-grams is exponential in q if we think data
> volume to be much higher then number of possible q-grams. But with high q
> real limitation is total number of q-grams extracted from dataset. In worst
> case each extracted q-gram is unique. This means that entries pages number
> would be comparable with data pages number. In this case index size with
> high q would be few times greater that index size with low q.

So with q=5, the index will be approximately 10x larger than with q=3.
Maybe that's OK, I'm not sure. But it is a big difference.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Proposal: q-gram GIN and GiST indexes at 2011-04-05 08:52:59 from Alexander Korotkov

Responses

Re: Proposal: q-gram GIN and GiST indexes at 2011-04-05 12:09:49 from Alexander Korotkov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alexander Korotkov	2011-04-05 12:09:49	Re: Proposal: q-gram GIN and GiST indexes
Previous Message	Robert Haas	2011-04-05 11:52:26	Re: Re: synchronous_commit and synchronous_replication Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.