Quick Links

Re: [GENERAL] Creation of tsearch2 index is very slow

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Stephan Vollmer <svollmer(at)gmx(dot)de>, pgsql-performance(at)postgreSQL(dot)org
Subject:	Re: [GENERAL] Creation of tsearch2 index is very slow
Date:	2006-01-20 21:50:17
Message-ID:	27264.1137793817@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general pgsql-performance

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...

Yeah, I just finished doing that and got about a 20x overall speedup
(30-some seconds to build the index instead of 10 minutes). However,
hemdistsign is *still* 70% of the runtime after doing that. The problem
seems to be that gtsvector_picksplit is calling it an inordinate number
of times:

0.53 30.02 1649/1649 FunctionCall2 <cycle 2> [19]
[20] 52.4 0.53 30.02 1649 gtsvector_picksplit [20]
29.74 0.00 23519673/33035195 hemdistsign [18]
0.14 0.00 22985469/22985469 hemdistcache [50]
0.12 0.00 268480/10030581 makesign [25]
0.02 0.00 270400/270400 fillcache [85]
0.00 0.00 9894/4077032 AllocSetAlloc [34]
0.00 0.00 9894/2787477 MemoryContextAlloc [69]

(hemdistcache calls hemdistsign --- I think gprof is doing something
funny with tail-calls here, and showing hemdistsign as directly called
from gtsvector_picksplit when control really arrives through hemdistcache.)

The bulk of the problem is clearly in this loop, which performs O(N^2)
comparisons to find the two entries that are furthest apart in hemdist
terms:

for (k = FirstOffsetNumber; k < maxoff; k = OffsetNumberNext(k))
{
for (j = OffsetNumberNext(k); j <= maxoff; j = OffsetNumberNext(j))
{
if (k == FirstOffsetNumber)
fillcache(&cache[j], GETENTRY(entryvec, j));

size_waste = hemdistcache(&(cache[j]), &(cache[k]));
if (size_waste > waste)
{
waste = size_waste;
seed_1 = k;
seed_2 = j;
}
}
}

I wonder if there is a way to improve on that.

regards, tom lane

In response to

Re: [GENERAL] Creation of tsearch2 index is very slow at 2006-01-20 21:37:54 from Martijn van Oosterhout

Responses

Re: [GENERAL] Creation of tsearch2 index is very slow at 2006-01-20 22:16:55 from Steinar H. Gunderson
Re: [GENERAL] Creation of tsearch2 index is very slow at 2006-01-20 22:33:27 from Martijn van Oosterhout
Re: [GENERAL] Creation of tsearch2 index is very slow at 2006-01-20 23:28:43 from Steinar H. Gunderson

Browse pgsql-general by date

	From	Date	Subject
Next Message	Jim C. Nasby	2006-01-20 22:00:40	Re: Page-Level Encryption
Previous Message	Rick Gigger	2006-01-20 21:49:12	panic on 7.3

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Jignesh K. Shah	2006-01-20 22:13:58	Re: Sudden slowdown of Pg server
Previous Message	Steinar H. Gunderson	2006-01-20 21:44:02	Re: [GENERAL] Creation of tsearch2 index is very slow