Re: GIN - Generalized Inverted iNdex.

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN - Generalized Inverted iNdex.
Date: 2006-04-07 13:52:00
Message-ID: Pine.GSO.4.63.0604071750480.23110@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

One addition - some results of Gin testing is available:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/GinTest

Oleg

On Fri, 7 Apr 2006, Teodor Sigaev wrote:

> We (me and Oleg) are glad to present GIN to PostgreSQL. If community will
> agree, we will commit it to HEAD branch.
>
> http://www.sigaev.ru/gin/gin.gz
> http://www.sigaev.ru/gin/README
>
> Install:
> % cd pgsql
> % zcat gin.gz | patch -p0
> make and initdb
>
> README:
>
> Gin for PostgreSQL
>
> Gin stands for Generalized Inverted iNdex and should be considered as a
> genie,
> not drink.
>
> Generalized means that index doesn't know what operation it accelerates, it
> works with strategies, defined for specific data type (Index Method
> Strategies).
> In that sense, gin is similar to GiST and differ from btree index,which has
> predefined comparison based operations.
>
> Inverted index is an index structure storing a (key,posting list) pairs,
> where
> posting list is a set of documents, which contain this key (document,
> usually,
> contains many keys). The primary goal of the Gin index is a scalable full
> text
> search in PostgreSQL.
>
> Gin is consists of a B-tree builded over entries (ET, entries tree), where
> entry is an element of indexed value ( element of array, lexeme for tsvector)
> and where each tuple in the leaf pages is either pointer to B-tree over item
> pointers (PT, posting tree), or list of item pointers (PL, posting list) if
> tuple is small enough.
>
> Notes: There is no delete operation for ET. The reason for that is that from
> our experience, a set of unique words of a whole collection changed very
> rare.
> This greatly simplify code and concurrency algorithm.
>
> Gin comes with built-in support for one dimensional arrays ( integer[],
> text[],
> no support for NULL elements) and following operations:
>
> * contains : value_array @ query_array
> * overlap : value_array && query_array
> * contained: value_array ~ query_array
>
> Synopsis
>
> =# create index txt_idx on aa using gin(a);
>
> Features
>
> * Concurrency
> * WAL-logging (recoverable)
> * user-defined opclass, the scheme is similar to GiST
> * optimized index creation (make use of maintenance_work_mem to
> accumulate
> postings in memory)
>
> Limitations
>
> * no support for multicolumn indices
> * Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples
> * Gin search entry only by equality matching, this may be improved in
> future
>
> Gin interface
>
> OpClass interface (pseudocode). Example for opclass is in ginarayproc.c.
>
> Datum* extractValue(Datum inputValue, uint32* nentries)
> Returns array of Datum of entries of value to be indexed, nentries should
> contains number of returning entries
> int compareEntry( Datum a, Datum b )
> Compares two entries (not the indexing values!)
> Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n)
> Returns array of Datum of entries of query to be searched, n contains
> Strategy number of operation.
> bool consistent( bool[] check, StrategyNumber n, Datum query)
> The size of check array is the same as sizeof of array returned by
> extractQuery. Each element of check array is true if indexed value has
> corresponding entry in the query, i.e. if check[i] == TRUE then i-th
> entry
> of query is presented in indexed value. Function should returns true if
> indexed value matches by StrategyNumber and query.
>
> Open items (We appreciate any comments, help, advice and recommendations).
>
> * teach optimizer/executor, that GIN is intrinsically clustered, i.e., it
> always returns ItemPointer in ascending order.
> * tweak gincostestimate
> * GIN stores several ItemPointer to heap tuple, so vacuum full produces
> warning message:
> WARNING: index "idx" contains 88395 row versions, but table contains
> 51812 row versions
> HINT: Rebuild the index with REINDEX.
>
> TODO
>
> Nearest future:
>
> * add tsearch2 opclass
> * create full scale test suite
>
> Distant future:
>
> * Replace entry btree to something like to GiST
> * Add multicolumn support
> * Optimize insert operation (use background index insertion)
>
>
> Authors:
> Teodor Sigaev <teodor(at)sigaev(dot)ru>
> Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru)
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Oleg Bartunov 2006-04-07 13:53:36 Re: GIN - Generalized Inverted iNdex.
Previous Message Nicolas Barbier 2006-04-07 13:41:35 Re: WAL Bypass for indexes