From: | Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Teodor Sigaev <teodor(at)sigaev(dot)ru>, Patches <pgsql-patches(at)postgresql(dot)org> |
Subject: | Re: Partial match in GIN (next vesrion) |
Date: | 2008-05-16 18:37:27 |
Message-ID: | Pine.LNX.4.64.0805162226440.21547@sn.sai.msu.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-patches |
Wildspeed was designed as an example application of the GIN's partial
match and as a useful extension for *short* strings. It's also good
standalone demonstration of GIN API. We tried to stay away from full text
search, parser, word delimiters and etc.
>From that point of view it might be
useful contrib, but I agree we have to think better to let it more
usable.
Oleg
On Fri, 16 May 2008, Tom Lane wrote:
> Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> http://www.sigaev.ru/misc/partial_match_gin-0.10.gz
>> http://www.sigaev.ru/misc/tsearch_prefix-0.9.gz
>> http://www.sigaev.ru/misc/wildspeed-0.12.tgz
>
> I've applied the first two of these with minor editorialization (mostly
> fixing documentation). However, I'm having a hard time convincing myself
> that anyone will find wildspeed useful in its current form. I did a
> simple experiment using a table of titles of database papers:
>
> contrib_regression=# select count(*), avg(length(title)) from pub;
> count | avg
> --------+---------------------
> 236984 | 64.7647520507713601
> (1 row)
>
> This takes about 22MB on disk as a Postgres table. I was expecting the
> wildspeed index to be about 65 times as large, which is bad enough
> already, but actually it weighed in at 2165MB or nearly 100X bigger.
> Plus it took forever to build: 35 minutes on a fairly fast machine
> with maintenance_work_mem set to 512MB.
>
> In comparison, building a conventional full-text-search index (GIN
> tsvector) took about 22 seconds including constructing the tsvector
> column, and the tsvectors plus index take about 54MB. The relative
> search performance is about what you'd expect from the difference in
> index sizes, ie, wildspeed loses.
>
> So I'm thinking wildspeed really needs to be redesigned if it's to be
> anything but a toy. I can't see putting it into contrib in this form.
>
> One idea that I had was to break the given string into words (splitting
> at spaces or punctuation) and store the rotations of individual words
> instead of the whole string. (Actually, maybe you only need suffixes
> not rotations, ie for 'abcd' store 'abcd', 'bcd', 'cd', 'd'.) Then
> similarly break the LIKE pattern apart at words to create word-fragment
> search keys. In this scheme the operator would always(?) require
> rechecking since any part of the pattern involving punctuation wouldn't
> be checkable by the index. The advantage is that the index bloat factor
> is governed by the average word length not the average whole-string
> length.
>
> There are probably other approaches that would help, too.
>
> regards, tom lane
>
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
From | Date | Subject | |
---|---|---|---|
Next Message | Merlin Moncure | 2008-05-16 19:49:36 | Re: libpq object hooks |
Previous Message | Andrew Chernow | 2008-05-16 18:34:20 | Re: libpq object hooks |