From: | Rod Taylor <rod(dot)taylor(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: GIN improvements part2: fast scan |
Date: | 2013-11-15 15:17:29 |
Message-ID: | CAKddOFCqZw-t0e1aoz7i+de0itVU9x1=bNcn8kv=g_92TK1BWA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I tried again this morning using gin-packed-postinglists-16.patch and
gin-fast-scan.6.patch. No crashes during index building.
Pg was compiled with debug enabled in both cases. The data is a ~0.1%
random sample of production data (10,000,000 records for the test) with the
below structure.
Table "public.kp"
Column | Type | Modifiers
--------+---------+-----------
id | bigint | not null
string | text | not null
score1 | integer |
score2 | integer |
score3 | integer |
score4 | integer |
Indexes:
"kp_pkey" PRIMARY KEY, btree (id)
"kp_string_key" UNIQUE CONSTRAINT, btree (string)
"textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string))
WHERE score1 IS NOT NULL
All data is in Pg buffer cache for these timings. Words like "the" and
"and" are very common (~9% of entries, each) and a word like "hotel" is
much less common (~0.2% of entries). Below is the query tested:
SELECT id,string
FROM kp
WHERE score1 IS NOT NULL
AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
-- ? is substituted with the query strings
ORDER BY score1 DESC, score2 ASC
LIMIT 1000;
Limit (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032
rows=142 loops=1)
-> Sort (cost=56.04..56.04 rows=1 width=37) (actual
time=250.008..250.017 rows=142 loops=1)
Sort Key: score1, score2
Sort Method: quicksort Memory: 36kB
-> Bitmap Heap Scan on kp (cost=52.01..56.03 rows=1 width=37)
(actual time=249.711..249.945 rows=142 loops=1)
Recheck Cond: ((to_tsvector('simple'::regconfig, string) @@
'''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
-> Bitmap Index Scan on textsearch_gin_idx
(cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142
loops=1)
Index Cond: (to_tsvector('simple'::regconfig, string)
@@ '''hotel'' & ''and'' & ''the'''::tsquery)
Total runtime: 250.096 ms
Times are from \timing on.
MASTER
=======
the: 888.436 ms 926.609 ms 885.502 ms
and: 944.052 ms 937.732 ms 920.050 ms
hotel: 53.992 ms 57.039 ms 65.581 ms
and & the & hotel: 260.308 ms 248.275 ms 248.098 ms
These numbers roughly match what we get with Pg 9.2. The time savings
between 'the' and 'and & the & hotel' is mostly heap lookups for the score
and the final sort.
The size of the index on disk is about 2% smaller in the patched version.
PATCHED
=======
the: 1055.169 ms 1081.976 ms 1083.021 ms
and: 912.173 ms 949.364 ms 965.261 ms
hotel: 62.591 ms 64.341 ms 62.923 ms
and & the & hotel: 268.577 ms 259.293 ms 257.408 ms
hotel & and & the: 253.574 ms 258.071 ms 250.280 ms
I was hoping that the 'and & the & hotel' case would improve with this
patch to be closer to the 'hotel' search, as I thought that was the kind of
thing it targeted. Unfortunately, it did not. I actually applied the
patches, compiled, initdb/load data, and ran it again thinking I made a
mistake.
Reordering the terms 'hotel & and & the' doesn't change the result.
On Fri, Nov 15, 2013 at 1:51 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
> On Fri, Nov 15, 2013 at 3:25 AM, Rod Taylor <rbt(at)simple-knowledge(dot)com>wrote:
>
>> I checked out master and put together a test case using a small
>> percentage of production data for a known problem we have with Pg 9.2 and
>> text search scans.
>>
>> A small percentage in this case means 10 million records randomly
>> selected; has a few billion records.
>>
>>
>> Tests ran for master successfully and I recorded timings.
>>
>>
>>
>> Applied the patch included here to master along with
>> gin-packed-postinglists-14.patch.
>> Run make clean; ./configure; make; make install.
>> make check (All 141 tests passed.)
>>
>> initdb, import dump
>>
>>
>> The GIN index fails to build with a segfault.
>>
>
> Thanks for testing. See fixed version in thread about packed posting lists.
>
> ------
> With best regards,
> Alexander Korotkov.
>
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2013-11-15 15:18:24 | Re: strncpy is not a safe version of strcpy |
Previous Message | Tom Lane | 2013-11-15 15:14:43 | Re: Database disconnection and switch inside a single bgworker |