Re: WIP: Fast GiST index build

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: Fast GiST index build
Date: 2011-08-25 18:53:10
Message-ID: 4E569A16.8090405@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24.08.2011 16:57, Alexander Korotkov wrote:
> I've added some testing results to the wiki page:
> http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011
> There are not all the results I planned for the first chunk because it takes
> more time than I expect.
> Some notes about it.
>
> Now I see two causes which accelerate regular build of GiST indexes:
> 1) As it was noted before regular index build of pretty ordered dataset is
> fast.
> 2) I found that worse index is faster to build. I mean worse index is index
> with higher overlaps. Function gistchoose selects the first index tuple with
> zero penalty if any. Thus, with higher overlap in root page only few index
> tuples of it will be choosed for insert. And, recursively, only small part
> of the tree will be used for actual inserts. And that part of tree can
> easier fit to the cache. Thus, high overlaps makes inserts cheaper as much
> as searches expensiver.

As an extreme case, a trivial penalty function that just always returns
0 will make index build fast - but the index will be useless for querying.

> In the tests on the first version of patch I found index quality of regular
> build much better than it of buffering build (without neighborrelocation).
> Now it's similar, though it's because index quality of regular index build
> become worse. There by in current tests regular index build is faster than
> in previous. I see following possible causes of it:
> 1) I didn't save source random data. So, now it's a new random data.
> 2) Some environment parameters of my test setup may alters, though I doubt.
> Despite these possible explanation it seems quite strange for me.

That's pretty surprising. Assuming the data is truly random, I wouldn't
expect a big difference in the index quality of one random data set over
another. If the index quality depends so much on, say, the distribution
of the few first tuples that are inserted to it, that's a quite
interesting find on its own, and merits some further research.

> In order to compare index build methods on more qualitative indexes, I've
> tried to build indexes with my double sorting split method (see:
> http://syrcose.ispras.ru/2011/files/SYRCoSE2011_Proceedings.pdf#page=36) So
> on uniform dataset search is faster in about 10 times! And, as it was
> expected, regular index build becomes much slower. It runs more than 60
> hours and while only 50% of index is complete (estimated by file sizes).
>
> Also, automatic switching to buffering build shows better index quality
> results in all the tests. While it's hard for me to explain that.

Hmm, makes me a bit uneasy that we're testing with a modified page
splitting algorithm. But if the new algorithm is that good, could you
post that as a separate patch, please?

That said, I don't see any new evidence that the buffering build
algorithm would be significantly worse. There's the case of ordered data
that we already knew about, and will have to just accept for now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-08-25 18:53:43 Re: Backup's from standby
Previous Message Tom Lane 2011-08-25 18:45:42 Re: A couple of issues with psql variable substitution