Re: tsearch2, large data and indexes

From: Ivan Voras <ivoras(at)freebsd(dot)org>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: tsearch2, large data and indexes
Date: 2014-04-22 07:57:14
Message-ID: CAF-QHFWbee3Anp+KKc5DEe2wufHtWYZ8zc1ysqXu2-0SW+v-UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 22 April 2014 08:40, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 04/20/2014 02:15 AM, Ivan Voras wrote:
>> More details: after thinking about it some more, it might have
>> something to do with tsearch2 and indexes: the large data in this case
>> is a tsvector, indexed with GIN, and the query plan involves a
>> re-check condition.
>>
>> The query is of the form:
>> SELECT simple_fields FROM table WHERE fts @@ to_tsquery('...').
>>
>> Does the "re-check condition" mean that the original tsvector data is
>> always read from the table in addition to the index?
>
>
> Yes, if the re-check condition involves the fts column. I don't see why you
> would have a re-check condition with a query like that, though. Are there
> some other WHERE-conditions that you didn't show us?

Yes, I've read about tsearch2 and GIN indexes and there shouldn't be a
recheck condition - but there is.
This is the query:

SELECT documents.id, title, raw_data, q, ts_rank(fts_data, q, 4) AS
rank, html_filename
FROM documents, to_tsquery('document') AS q
WHERE fts_data @@ q
ORDER BY rank DESC LIMIT 25;

And here is the explain analyze: http://explain.depesz.com/s/4xm
It clearly shows a bitmap index scan operation is immediately followed
by a recheck operation AND that the recheck operation actually does
something, because it reduces the number of records from 61 to 58
(!!!).

This is the table structure:

nn=# \d documents
Table "public.documents"
Column | Type | Modifiers
---------------+----------+--------------------------------------------------------
id | integer | not null default
nextval('documents_id_seq'::regclass)
ctime | integer | not null default unix_ts(now())
dtime | integer | not null
title | text | not null
html_filename | text | not null
raw_data | text |
fts_data | tsvector | not null
tags | text[] |
dtype | integer | not null default 0
flags | integer | not null default 0
Indexes:
"documents_pkey" PRIMARY KEY, btree (id)
"documents_html_filename" UNIQUE, btree (html_filename)
"documents_dtime" btree (dtime)
"documents_fts_data" gin (fts_data)
"documents_tags" gin (tags)

> The large fields are stored in the toast table. You can check if the toast
> table is accessed with a query like this:
>
> select * from pg_stat_all_tables where relid = (select reltoastrelid from
> pg_class where relname='table');
>
> Run that before and after your query, and see if the numbers change.

Before:

relid|schemaname|relname|seq_scan|seq_tup_read|idx_scan|idx_tup_fetch|n_tup_ins|n_tup_upd|n_tup_del|n_tup_hot_upd|n_live_tup|n_dead_tup|last_vacuum|last_autovacuum|last_analyze|last_autoanalyze|vacuum_count|autovacuum_count|analyze_count|autoanalyze_count
27290|pg_toast|pg_toast_27283|3|0|2481289|10631453|993194|0|266306|0|147931|2514||2014-04-18
00:49:11.066443+02|||0|11|0|0

After:

relid|schemaname|relname|seq_scan|seq_tup_read|idx_scan|idx_tup_fetch|n_tup_ins|n_tup_upd|n_tup_del|n_tup_hot_upd|n_live_tup|n_dead_tup|last_vacuum|last_autovacuum|last_analyze|last_autoanalyze|vacuum_count|autovacuum_count|analyze_count|autoanalyze_count
27290|pg_toast|pg_toast_27283|3|0|2481347|10632814|993194|0|266306|0|147931|2514||2014-04-18
00:49:11.066443+02|||0|11|0|0

idx_scan has changed from 2481289 to 2481347 (58)
idx_tup_fetch has changed from 10631453 to 10632814 (1361)

Number 58 corresponds to the number of rows found by the index, seen
in the EXPLAIN output, I don't know where 1361 comes from.

I'm also surprised by the amount of memory used for sorting (23 kB),
since the actually returned data from my query (all the tuples from
all the 58 rows) amount to around 2 kB - but this is not an actual
problem.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Souquieres Adam 2014-04-22 10:14:00 Query on partitioned table not using index
Previous Message Oleg Bartunov 2014-04-22 07:15:02 Re: Workaround: Planner preference for tsquery filter vs. GIN index in fast text search