| From: | Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> | 
|---|---|
| To: | |
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | Re: gsoc, oprrest function for text search take 2 | 
| Date: | 2008-09-19 16:05:36 | 
| Message-ID: | 48D3CDD0.9090105@students.mimuw.edu.pl | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
ju219721(at)students(dot)mimuw(dot)edu(dot)pl wrote:
> Quoting Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> 
>> I wrote:
>>> ...  One possibly
>>> performance-relevant point is to use DatumGetTextPP for detoasting;
>>> you've already paid the costs by using VARDATA_ANY etc, so you might
>>> as well get the benefit.
>>
>> Actually, wait a second.  That code doesn't work at all on toasted data,
>> because it's trying to use VARSIZE_ANY_EXHDR() before detoasting.
>> That would give you the physical datum size (eg the size of the toast
>> pointer), not the number you need.
>>
>> However, this is actually not a problem because we know that the data
>> came from an array in pg_statistic, which means the individual members
>> *can't be toasted*.  At least they can't be compressed or out-of-line.
>> We'd do that at the array level, it's not sensible to do it on an
>> individual array member.
>>
>> I think that right at the moment the array stuff doesn't permit short
>> headers either, but it would make sense to relax that someday.  So I'd
>> recommend that your code allow either regular or short headers, but not
>> worry about compression or out-of-line storage.
>>
>> Which boils down to: keep using VARSIZE_ANY_EXHDR/VARDATA_ANY, but
>> forget the "detoasting" step.  Maybe put in
>>     Assert(!VARATT_IS_COMPRESSED(datum) && !VARATT_IS_EXTERNAL(datum))
>> instead.
Well whaddya know. It turned out that my new company has a 
'Fridays-are-for-any-opensource-hacking-you-like' policy, so I got a 
full day to work on the patch.
Attached is a version that stores the minimal and maximal frequencies in 
the Numbers array, has the aforementioned assertion and more nicely 
ordered functions in ts_selfuncs.c.
I tested it with oprofile and
pgbench -n -f tssel-bench.sql -t 1000 postgres
with tssel-bench.sql containing
select * from manuals where tsvector @@ to_tsquery('foo');
"manuals" has ~700 rows and 'foo' does not appear in any of the lexemes.
The results are:
=== CVS HEAD ===
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.399584 (including connections establishing)
tps = 13.399972 (excluding connections establishing)
74069    34.7779  pglz_decompress
38560    18.1052  tsvectorout
7688      3.6098  pg_mblen
5366      2.5195  hash_search_with_hash_value
4833      2.2693  pg_utf_mblen
4718      2.2153  AllocSetAlloc
4041      1.8974  index_getnext
3100      1.4556  LWLockAcquire
3056      1.4349  hash_any
2843      1.3349  LWLockRelease
2611      1.2260  AllocSetFree
2126      0.9982  tsCompareString
2121      0.9959  _bt_compare
1830      0.8592  LockAcquire
1517      0.7123  toast_fetch_datum
1503      0.7057  .plt
1338      0.6282  _bt_checkkeys
1332      0.6254  FunctionCall2
1233      0.5789  ReadBuffer_common
1185      0.5564  slot_deform_tuple
1157      0.5433  TParserGet
1123      0.5273  LockRelease
=== PATCH ===
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 1000
number of transactions actually processed: 1000/1000
tps = 13.309346 (including connections establishing)
tps = 13.309761 (excluding connections establishing)
171514   35.0802  pglz_decompress
87231    17.8416  tsvectorout
17107     3.4989  pg_mblen
12514     2.5595  hash_search_with_hash_value
11124     2.2752  pg_utf_mblen
10739     2.1965  AllocSetAlloc
8534      1.7455  index_getnext
7460      1.5258  LWLockAcquire
6876      1.4064  LWLockRelease
6622      1.3544  hash_any
5773      1.1808  AllocSetFree
5210      1.0656  _bt_compare
4849      0.9918  tsCompareString
4043      0.8269  LockAcquire
3535      0.7230  .plt
3246      0.6639  _bt_checkkeys
3170      0.6484  toast_fetch_datum
3057      0.6253  FunctionCall2
2815      0.5758  ReadBuffer_common
2767      0.5659  TParserGet
2605      0.5328  slot_deform_tuple
2567      0.5250  MemoryContextAlloc
Cheers,
Jan
-- 
Jan Urbanski
GPG key ID: E583D7D2
ouden estin
| Attachment | Content-Type | Size | 
|---|---|---|
| tssel-oprrest-presorted.diff | text/plain | 21.5 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2008-09-19 16:11:02 | Re: [PATCHES] libpq events patch (with sgml docs) | 
| Previous Message | Robert Haas | 2008-09-19 15:34:06 | Re: Proposal of SE-PostgreSQL patches (for CommitFest:Sep) |