Quick Links

Re: [GENERAL] Creation of tsearch2 index is very slow

From:	"Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>, pgsql-performance(at)postgresql(dot)org, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject:	Re: [GENERAL] Creation of tsearch2 index is very slow
Date:	2006-01-21 01:30:17
Message-ID:	43D18EA9.8060409@modgraph-usa.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-performance

Tom Lane wrote:
> Well, we're trying to split an index page that's gotten full into two
> index pages, preferably with approximately equal numbers of items in
> each new page (this isn't a hard requirement though). ... If that's
> correct, what you really want is to divide the values so that the unions
> of the two sets have minimal overlap ... which seems to me to have
> little to do with what the code does at present.

This problem has been studied extensively by chemists, and they haven't found any easy solutions.

The Jarvis Patrick clustering algorithm might give you hints about a fast approach. In theory it's K*O(N^2), but J-P is preferred for large datasets (millions of molecules) because the coefficient K can be made quite low. It starts with a "similarity metric" for two bit strings, the Tanimoto or Tversky coefficients:

http://www.daylight.com/dayhtml/doc/theory/theory.finger.html#RTFToC83

J-P Clustering is described here:

http://www.daylight.com/dayhtml/doc/cluster/cluster.a.html#cl33

J-P Clustering is probably not the best for this problem (see the illustrations in the link above to see why), but the general idea of computing N-nearest-neighbors, followed by a partitioning step, could be what's needed.

Craig

In response to

Re: [GENERAL] Creation of tsearch2 index is very slow at 2006-01-21 00:23:10 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Michael Fuhr	2006-01-21 02:28:28	Re: standard normal cumulative distribution function
Previous Message	Keary Suska	2006-01-21 01:20:22	Re: How to fetch rows with multiple values

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Rikard Pavelic	2006-01-21 07:59:28	Re: [PERFORMANCE] Stored Procedures
Previous Message	Steinar H. Gunderson	2006-01-21 00:36:46	Re: [GENERAL] Creation of tsearch2 index is very slow