From: | Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, pgsql-performance(at)postgreSQL(dot)org |
Subject: | Re: Extremely slow intarray index creation and inserts. |
Date: | 2009-03-18 16:12:29 |
Message-ID: | 49C11D6D.8020604@cheapcomplexdevices.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Tom Lane wrote:
> Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com> writes:
>> vm=# create index "gist70000" on tmp_intarray_test using GIST (my_int_array gist__int_ops);
>> CREATE INDEX
>> Time: 2069836.856 ms
>
>> Is that expected, or does it sound like a bug to take over
>> half an hour to index 70000 rows of mostly 5 and 6-element
>> integer arrays?
>
> I poked at this example with oprofile. It's entirely CPU-bound AFAICT,
Oleg pointed out to me (off-list I now see) that it's not totally
unexpected behavior and I should have been using gist__intbig_ops,
since the "big" refers to the cardinality of the entire set (which
was large, in my case) and not the length of the arrays.
Oleg Bartunov wrote:
OB:> it's not about short or long arrays, it's about small or big
OB:> cardinality of the whole set (the number of unique elements)
I'm re-reading the docs and still wasn't obvious to me. A
potential docs patch is attached below.
> and the CPU utilization is approximately
>
> 55% g_int_compress
> 35% memmove/memcpy (difficult to distinguish these)
> 1% pg_qsort
> <1% anything else
>
> Probably need to look at reducing the number of calls to g_int_compress
> ... it must be getting called a whole lot more than once per new index
> entry, and I wonder why that should need to be.
Perhaps that's a separate issue, but we're working
fine with gist__intbig_ops for the time being.
Here's a proposed docs patch that makes this more obvious.
*** a/doc/src/sgml/intarray.sgml
--- b/doc/src/sgml/intarray.sgml
***************
*** 239,245 ****
<literal>gist__int_ops</> (used by default) is suitable for
small and medium-size arrays, while
<literal>gist__intbig_ops</> uses a larger signature and is more
! suitable for indexing large arrays.
</para>
<para>
--- 239,247 ----
<literal>gist__int_ops</> (used by default) is suitable for
small and medium-size arrays, while
<literal>gist__intbig_ops</> uses a larger signature and is more
! suitable for indexing high-cardinality data sets - where there
! are a large number of unique elements across all rows being
! indexed.
</para>
<para>
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Carey | 2009-03-18 17:43:18 | Re: Proposal of tunable fix for scalability of 8.4 |
Previous Message | Matthew Wakeling | 2009-03-18 13:49:05 | Re: Proposal of tunable fix for scalability of 8.4 |