From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Proposal: q-gram GIN and GiST indexes |
Date: | 2011-04-05 08:52:59 |
Message-ID: | BANLkTinVYEVBq1=Lcs-7zzApCmax-wTXsw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > relatively small when q <= 5. Accordingly, I think we should expect
> indexes
> > to be usable with at least with q = 5.
>
> I defer to your opinion on this, since you know more about it than I
> do. But I think it would still be worthwhile to write a quick Perl
> script and calculate the number q-grams in various sample texts for
> various values of q. The worst case is surely exponential in q, so
> it'd be nice to have some evidence of what the real-world behavior is.
Here is distribution of numbers of different q-grams count in various
datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for
example, lowercased) should have lower counts.
q ds1 ds2 ds3 ds4
2 2313 3461 1625 1288
3 15146 25094 14090 10728
4 58510 105908 69127 47499
5 161801 298466 182680 110929
6 351175 633750 331090 176336
7 613299 1049088 496426 234730
8 921962 1450715 657965 283698
9 1248339 1793158 802188 321261
10 1556838 2066775 926043 348058
ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
ds3 - set of person first and last names, 2142298 bytes
ds4 - english dictionary, 931708 bytes
Sure, q-grams count grows with q increasing. At low q we can see
approximately exponential grow. At high q grow is slowing and it is
approximately linear.
In the worst case count of q-grams is exponential in q if we think data
volume to be much higher then number of possible q-grams. But with high q
real limitation is total number of q-grams extracted from dataset. In worst
case each extracted q-gram is unique. This means that entries pages number
would be comparable with data pages number. In this case index size with
high q would be few times greater that index size with low q.
----
With best regards,
Alexander Korotkov.
From | Date | Subject | |
---|---|---|---|
Next Message | Shigeru HANADA | 2011-04-05 10:03:07 | Re: Re: [COMMITTERS] pgsql: Support comments on FOREIGN DATA WRAPPER and SERVER objects. |
Previous Message | Dimitri Fontaine | 2011-04-05 08:44:09 | Re: Re: synchronous_commit and synchronous_replication Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication. |