Re: gsoc, oprrest function for text search take 2

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc, oprrest function for text search take 2
Date: 2008-09-19 16:05:36
Message-ID: 48D3CDD0.9090105@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

ju219721(at)students(dot)mimuw(dot)edu(dot)pl wrote:
> Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>
>> I wrote:
>>> ... One possibly
>>> performance-relevant point is to use DatumGetTextPP for detoasting;
>>> you've already paid the costs by using VARDATA_ANY etc, so you might
>>> as well get the benefit.
>>
>> Actually, wait a second. That code doesn't work at all on toasted data,
>> because it's trying to use VARSIZE_ANY_EXHDR() before detoasting.
>> That would give you the physical datum size (eg the size of the toast
>> pointer), not the number you need.
>>
>> However, this is actually not a problem because we know that the data
>> came from an array in pg_statistic, which means the individual members
>> *can't be toasted*. At least they can't be compressed or out-of-line.
>> We'd do that at the array level, it's not sensible to do it on an
>> individual array member.
>>
>> I think that right at the moment the array stuff doesn't permit short
>> headers either, but it would make sense to relax that someday. So I'd
>> recommend that your code allow either regular or short headers, but not
>> worry about compression or out-of-line storage.
>>
>> Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but
>> forget the "detoasting" step. Maybe put in
>> Assert(!VARATT_IS_COMPRESSED(datum) && !VARATT_IS_EXTERNAL(datum))
>> instead.

Well whaddya know. It turned out that my new company has a
'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a
full day to work on the patch.
Attached is a version that stores the minimal and maximal frequencies in
the Numbers array, has the aforementioned assertion and more nicely
ordered functions in ts_selfuncs.c.

I tested it with oprofile and
pgbench -n -f tssel-bench.sql -t 1000 postgres
with tssel-bench.sql containing
select * from manuals where tsvector @@ to_tsquery('foo');

"manuals" has ~700 rows and 'foo' does not appear in any of the lexemes.

The results are:
=== CVS HEAD ===
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.399584 (including connections establishing)
tps = 13.399972 (excluding connections establishing)

74069 34.7779 pglz_decompress
38560 18.1052 tsvectorout
7688 3.6098 pg_mblen
5366 2.5195 hash_search_with_hash_value
4833 2.2693 pg_utf_mblen
4718 2.2153 AllocSetAlloc
4041 1.8974 index_getnext
3100 1.4556 LWLockAcquire
3056 1.4349 hash_any
2843 1.3349 LWLockRelease
2611 1.2260 AllocSetFree
2126 0.9982 tsCompareString
2121 0.9959 _bt_compare
1830 0.8592 LockAcquire
1517 0.7123 toast_fetch_datum
1503 0.7057 .plt
1338 0.6282 _bt_checkkeys
1332 0.6254 FunctionCall2
1233 0.5789 ReadBuffer_common
1185 0.5564 slot_deform_tuple
1157 0.5433 TParserGet
1123 0.5273 LockRelease

=== PATCH ===
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.309346 (including connections establishing)
tps = 13.309761 (excluding connections establishing)

171514 35.0802 pglz_decompress
87231 17.8416 tsvectorout
17107 3.4989 pg_mblen
12514 2.5595 hash_search_with_hash_value
11124 2.2752 pg_utf_mblen
10739 2.1965 AllocSetAlloc
8534 1.7455 index_getnext
7460 1.5258 LWLockAcquire
6876 1.4064 LWLockRelease
6622 1.3544 hash_any
5773 1.1808 AllocSetFree
5210 1.0656 _bt_compare
4849 0.9918 tsCompareString
4043 0.8269 LockAcquire
3535 0.7230 .plt
3246 0.6639 _bt_checkkeys
3170 0.6484 toast_fetch_datum
3057 0.6253 FunctionCall2
2815 0.5758 ReadBuffer_common
2767 0.5659 TParserGet
2605 0.5328 slot_deform_tuple
2567 0.5250 MemoryContextAlloc

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment Content-Type Size
tssel-oprrest-presorted.diff text/plain 21.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-09-19 16:11:02 Re: [PATCHES] libpq events patch (with sgml docs)
Previous Message Robert Haas 2008-09-19 15:34:06 Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep)