Re: speeding up a query on a large table

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Kevin Murphy <murphy(at)genome(dot)chop(dot)edu>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: speeding up a query on a large table
Date: 2005-08-17 20:28:49
Message-ID: nc77g1tfn1m5urhqphjtkatudd3o1nll6n@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, 25 Jul 2005 17:50:55 -0400, Kevin Murphy
<murphy(at)genome(dot)chop(dot)edu> wrote:
> and because the number of possible search terms is so large, it
>would be nice if the entire index could somehow be preloaded into memory
>and encouraged to stay there.

Postgres does not have such a feature and I wouldn't recommend to mess
around inside Postgres. You could try to copy the relevant index
file(s) to /dev/null to populate the OS cache ...

>There are 1.3M unique values in that column.

That'd mean that each value occours 10 times on average. In your
tests the planner consistently estimates 81, and the real numbers are
even higher. Can this be explained by the nature of the data
distribution?

>and from postgresql.conf:
>shared_buffers = 15000
>work_mem = 32768
>Everything else in postgresql.conf is default.

Setting effective_cache_size to a sane value wouldn't hurt. I don't
know about OS X; does it, like Linux, automatically tune its disk
cache or do you have to configure it somewhere?

>tagged_genes table:
>13,982,464 rows
>422,028 pages (although about half of that is the experimental tsvector
>column, though!)
>The index of the query column (mention) is 226,229 pages (= 1.7 GB?).

The average tuples per page ratio seems a bit low, both for the heap
(~33) and for the index (~62). If the planner's tuple size estimation
of 67 bytes is approximately right, there's a lot of free space in
your relations. Try VACUUM FULL and REINDEX or CLUSTER to shrink
these files.

>create table tagged_genes (
> id bigint NOT NULL PRIMARY KEY, -- artificial primary key
> mention text, -- a gene name or description
> pmid bigint, -- identifies the document that
>the mention occurs in
> create_date timestamp NOT NULL,
> vector tsvector -- experimental tsearch2 index of
>mention column
>);
>create index tg_mention_idx on tagged_genes(mention);
>create index tg_pmid_idx on tagged_genes(pmid);
>create index tg_vector_idx on tagged_genes(vector);

If mention is long (which is not implied by your examples, but an int
is still smaller than any nonempty text) and there are many
duplicates, it might pay off to put them in their own table:

CREATE TABLE mentions (
id SERIAL PRIMARY KEY,
mention text UNIQUE,
vector tsvector -- experimental tsearch2 index
) WITHOUT oids;
create index me_vector_idx on mentions(vector);

and reference them from tagged_genes:

create table tagged_genes (
id bigint NOT NULL PRIMARY KEY,
mentionid int REFERENCES mentions,
pmid bigint, -- identifies the document that
-- the mention occurs in
create_date timestamp NOT NULL
) WITHOUT oids;

Unless my math is wrong, this would result in a heap size of ~120K
pages and an index size of ~52K pages, plus some 10% slack for updated
and deleted tuples, if you VACUUM regularly.

Servus
Manfred

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bruno Wolff III 2005-08-17 20:29:02 Re: Finding nearest numeric value
Previous Message Tom Lane 2005-08-17 20:25:48 Re: How to "ping" the database