From: | Michal Fapso <michal(dot)fapso(at)gmail(dot)com> |
---|---|
To: | pgsql-performance(at)postgresql(dot)org |
Subject: | big data - slow select (speech search) |
Date: | 2010-07-01 22:34:53 |
Message-ID: | AANLkTilAMfcbnEpk_7eAr-ikWzFptS4VtvIT0s5RvWmH@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Hi,
I have quite a simple query but a lot of data and the SELECT query is
too slow. I will be really grateful for any advice on this.
--------------------------------------------------
The background info:
I work on a speech search engine which differs from text search in
having more words (hypotheses) on the same position and each
hypothesis has some weight (probability) of occurrence.
When a word 'hello' appears in a document 'lecture_1', there is a row
in the table hyps (see below) which contains an array of all positions
of word 'hello' in the document 'lecture_1' and for each position it
contains a weight as well.
I need the positions to be able to search for phrases. However, here I
simplified the query as much as I could without a significant
reduction in speed.
I know there is tsearch extension which could be more appropriate for
this but I didn't try that yet. The size of my data will be the same
which seems to be the issue in my case. But maybe I am wrong and with
tsearch it will be much faster. What do you think?
--------------------------------------------------
Preconditions:
First I cleared the disk cache:
sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
Then run the postgresql deamon and with psql client I connected to my
database. The first thing I did then was executing the SELECT query
described below. It took about 4.5 seconds. If I rerun it, it takes
less than 2 miliseconds, but it is because of the cache. I need to
optimize the first-run.
--------------------------------------------------
Hardware:
laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
--------------------------------------------------
Version:
PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 32-bit
compiled from sources, only --prefix=... argument given to ./configure
--------------------------------------------------
Schema:
CREATE TABLE hyps (
docid INT,
wordid INT,
positions INT[],
weights REAL[],
length INT,
total_weight REAL
);
COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
CREATE INDEX hyps_docid_index ON hyps USING hash (docid);
shared_buffers = 300MB ...this is the only thing I changed in the config
I tried that also with btree indices instead of hash and surprisingly
the SELECT query was a bit faster. I would expect hash to be faster.
The index on 'docid' column is there because I need to be able to
search also in a particular document or in a set of documents.
--------------------------------------------------
Table info:
- rows = 5490156
- average length of positions vectors = 19.5
- total number of items in positions vectors = 107444304
- positions and weights in one row have the same number of items, but
for each row the number may differ.
- table data are loaded only once (using COPY) and are not modified anymore
- there are 369 various docid and 161460 various wordid
- VACUUM was executed after COPY of data
--------------------------------------------------
Query:
EXPLAIN ANALYZE SELECT h1.docid
FROM hyps AS h1
WHERE h1.wordid=65658;
Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4)
(actual time=62.106..4416.864 rows=343 loops=1)
Recheck Cond: (wordid = 65658)
-> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92
rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
Index Cond: (wordid = 65658)
Total runtime: 4432.015 ms
The result has 343 rows and there are 9294 items in positions vectors in total.
--------------------------------------------------
Comparison with Lucene:
If I run the same query in Lucene search engine, it takes 0.105
seconds on the same data which is quite a huge difference.
--------------------------------------------------
Synthetic data set:
If you want to try it yourself, here is a script which generates the
data for COPY command. I don't know whether it is possible to send
attachments here, so I put the script inline. Just save it as
create_synthetic_data.pl and run it by 'perl
create_synthetic_data.pl'. With these synthetic data the SELECT query
times are around 2.5 seconds. You can try the SELECT query with
'wordid' equal 1, 2, 3, ...10000.
#!/usr/bin/perl
# Create synthetic data for PostgreSQL COPY.
$rows = 5490156;
$docs = 369;
$words = 161460;
$docid = 0;
$wordid = 0;
for ($row=0; $row<$rows; $row++) {
my $sep = "";
my $positions = "";
my $weights = "";
my $total_weight = 0;
my $items = int(rand(39))+1;
if ($row % int($rows/$docs) == 0) {
$docid++;
$wordid = 0;
}
$wordid++;
for ($i=0; $i<$items; $i++) {
$position = int(rand(20000));
$weight = rand(1);
$positions .= $sep.$position;
$weights .= $sep.sprintf("%.3f", $weight);
$total_weight += $weight;
$sep = ",";
}
print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
}
If you need any other info, I will gladly provide it.
Thank You for Your time.
Miso Fapso
From | Date | Subject | |
---|---|---|---|
Next Message | Eliot Gable | 2010-07-02 00:46:13 | Highly Efficient Custom Sorting |
Previous Message | Rajesh Kumar Mallah | 2010-07-01 17:59:20 | Re: how to (temporarily) disable/minimize benefits of disk block cache or postgresql shared buffer |