Re: procost for to_tsvector

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers\(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: procost for to_tsvector
Date: 2015-05-02 04:40:00
Message-ID: 55153.1430541600@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> In the OP, he suggested "on the order of 100". Maybe we could just
>>> go with 100.

> Tom> I'm OK with that in view of <87h9trs0zm(dot)fsf(at)news-spur(dot)riddles(dot)org(dot)uk>

> Note that the results from that post suggest 100 as a bare minimum,
> higher values would be quite reasonable.

I'm not entirely convinced that your experiments disentangled the CPU cost
of to_tsvector itself from the costs of detoasting its input, which is an
issue that we ought to address separately. In particular, comparing to
textlen() is unreliable for this purpose since in single-byte encodings
textlen() does not have to dereference a TOAST pointer at all.

It is possible to prove that to_tsvector() is much more expensive per-byte
than, say, md5():

regression=# select sum(length((repeat('xyzzy ', i)))) from generate_series(1,10000) i;
sum
-----------
300030000
(1 row)

Time: 360.423 ms
regression=# select sum(length(md5(repeat('xyzzy ', i)))) from generate_series(1,10000) i;
sum
--------
320000
(1 row)

Time: 1339.806 ms
regression=# select sum(length(to_tsvector(repeat('xyzzy ', i)))) from generate_series(1,10000) i;
sum
-------
10000
(1 row)

Time: 78564.333 ms

These numbers put md5() at about 3.3 nsec/input byte on my machine, and
to_tsvector() with the 'english' configuration at about 260 nsec/byte.
It's certainly possible that lots of repetitions of 'xyzzy ' isn't a very
representative sample of typical to_tsvector input; but at least this
test does not involve any toasted-value access. So, as I said, I'm okay
with costing to_tsvector() at 100x the cost of md5(). I'm not convinced
that any factor above that is to_tsvector's fault.

> Tom> and some experiments of my own, but I wonder why we are only
> Tom> thinking of to_tsvector. Isn't to_tsquery, for example, just
> Tom> about as expensive? What of other text search functions?

> Making the same change for to_tsquery and plainto_tsquery would be
> reasonable; that would help with the seqscan cost for cases like
> to_tsvector('config',col) @@ to_tsquery('blah') where the non-immutable
> form of to_tsquery is used.

Works for me.

> I don't recall seeing cases of any of the other functions figuring into
> planner decisions.

It's not so much "are they popular" as "do they involve parsing raw
text". Once you've got the tsvector or tsquery, later steps are
(I think) much more efficient.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2015-05-02 08:15:41 Re: feature freeze and beta schedule
Previous Message Peter Eisentraut 2015-05-02 03:04:49 Re: Auditing extension for PostgreSQL (Take 2)