table clustering brings joy

From: Kevin Murphy <murphy(at)genome(dot)chop(dot)edu>
To: PostgreSQL general <pgsql-general(at)postgresql(dot)org>
Subject: table clustering brings joy
Date: 2005-08-16 16:47:13
Message-ID: 43021891.5040500@genome.chop.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

This is just an FYI for other people out there with large tables: table
clustering sped up my queries from 10-100 times, which I am very happy
about. I'm posting this in case it's ever useful to anybody. If
someone reading this feels that I did something wrong, let me know.

I conducted tests on a table with 13,982,464 rows that looks like this:

create table clustered_tagged_genes (
pmid integer,
mention text
);

The table is indexed on pmid and mention, and clustered on mention. The
statistics target was set to 1000 for each index, and the table was
vacuum analyzed. I haven't yet methodically compared varying the
statistics target yet.

FYI, there are 1,348,398 distinct mention values and 4,162,269 distinct
pmid values in the table.

For comparison, I had a non-clustered version of the same table:

create table tagged_genes (
pmid integer,
mention text
);

This table is indexed on pmid and mention, but not clustered. (Again,
statistics targets set to 1000, and table vacuum analyzed).

I tried two different equivalent queries on these tables:

"group-by-query"

SELECT pmid
FROM {tablename}
WHERE mention IN ({term1} [, ...])
GROUP BY pmid
HAVING count(distinct mention) = {num_terms};

and

"intersect-query"

SELECT pmid FROM {table_name} WHERE mention = {term1}
[ INTERSECT
SELECT pmid FROM {table_name} WHERE mention = {term2}
...
]

Out of curiosity, I also created a denormalized version of the table,
which ganged associated pmids together into an array column type:

create table array_tagged_genes (
pmids integer[],
mention text
);

The table is indexed on mention, statistics target set to 1000, and
table vacuum analyzed.

For the array_tagged_genes query, I used this:

"array-query":

SELECT * FROM array_explode((select pmids FROM array_tagged_genes
WHERE mention = {term1}
[ INTERSECT
SELECT * FROM array_explode((select pmids FROM array_tagged_genes
WHERE mention = {term2}
...
]

where array_explode is a plpgsql function to unpack the pmids from their
array and return them on separate rows.

The following test results were created by a perl program that dragged a
file larger than memory through memory, restarted the postgresql server,
ran explain analyze on the next query and averaged the results across
three trials of each query. (Every individual trial hopefully performed
a query without the benefit of any operating system or postgresql
caching.) The computer was an otherwise unloaded dual 1.8GHz PowerMac
G5 running Mac OS X 10.4.2 with 1GB RAM, running off a SATA drive (7200
rpm, 9ms seek, if you care).

Rank Query type/table Search terms Rows Msecs Msecs/row Relative time
1 intersect-query/clustered mycn 479 49.84 0.10 1
2 array-query/array mycn 479 79.18 0.17 1.59
3 group-by-query/clustered mycn 479 163.43 0.34 3.28
4 group-by-query/nonclustered mycn 479 2238.79 4.67 44.92







1 group-by-query/clustered mycn,trka 20 109.08 5.45 1
2 intersect-query/clustered mycn,trka 20 138.64 6.93 1.27
3 array-query/array mycn,trka 20 264.25 13.21 2.42
4 group-by-query/nonclustered mycn,trka 20 10076.79 503.84 92.38







1 intersect-query/clustered lcat 1341 68.11 0.05 1
2 group-by-query/clustered lcat 1341 86.61 0.06 1.27
3 array-query/array lcat 1341 222.2 0.17 3.26
4 group-by-query/nonclustered lcat 1341 7325.77 5.46 107.55







1 intersect-query/clustered eps15 148 61.05 0.41 1
2 group-by-query/clustered eps15 148 67.75 0.46 1.11
3 array-query/array eps15 148 80.45 0.54 1.32
4 group-by-query/nonclustered eps15 148 929.84 6.28 15.23







1 intersect-query/clustered brca1 2822 105.74 0.04 1
2 group-by-query/clustered brca1 2822 161.95 0.06 1.53
3 array-query/array brca1 2822 198.72 0.07 1.88
4 group-by-query/nonclustered brca1 2822 12113.8 4.29 114.56







1 intersect-query/clustered eps15,tax4,irs4 1 102.9 102.90 1
2 group-by-query/clustered eps15,tax4,irs4 1 112.58 112.58 1.09
3 array-query/array eps15,tax4,irs4 1 124.78 124.78 1.21
4 group-by-query/nonclustered eps15,tax4,irs4 1 1024.09 1024.09 9.95

-Kevin Murphy

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Richard Huxton 2005-08-16 17:05:23 Re: lock problem
Previous Message marcelo Cortez 2005-08-16 16:40:39 lock problem