Re: big data - slow select (speech search)

From: Michal Fapso <michal(dot)fapso(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: big data - slow select (speech search)
Date: 2010-07-02 05:23:21
Message-ID: AANLkTimWfY3KJTqBKdFy9MAPKpWAJqnthwRFHG2NiAyS@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I forgot to mention one thing. If you want to generate data using the
perl script, do this:

perl create_synthetic_data.pl > synthetic_data.sqlcopy

and then after you create the 'hyps' table, use the COPY command with
the generated file:

COPY hyps FROM '/the/full/path/synthetic_data.sqlcopy';

Best regards,
Miso Fapso

On 2 July 2010 00:34, Michal Fapso <michal(dot)fapso(at)gmail(dot)com> wrote:
> Hi,
>
> I have quite a simple query but a lot of data and the SELECT query is
> too slow. I will be really grateful for any advice on this.
>
> --------------------------------------------------
> The background info:
>
> I work on a speech search engine which differs from text search in
> having more words (hypotheses) on the same position and each
> hypothesis has some weight (probability) of occurrence.
>
> When a word 'hello' appears in a document 'lecture_1', there is a row
> in the table hyps (see below) which contains an array of all positions
> of word 'hello' in the document 'lecture_1' and for each position it
> contains a weight as well.
>
> I need the positions to be able to search for phrases. However, here I
> simplified the query as much as I could without a significant
> reduction in speed.
>
> I know there is tsearch extension which could be more appropriate for
> this but I didn't try that yet. The size of my data will be the same
> which seems to be the issue in my case. But maybe I am wrong and with
> tsearch it will be much faster. What do you think?
>
> --------------------------------------------------
> Preconditions:
>
> First I cleared the disk cache:
>  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
>
> Then run the postgresql deamon and with psql client I connected to my
> database. The first thing I did then was executing the SELECT query
> described below. It took about 4.5 seconds. If I rerun it, it takes
> less than 2 miliseconds, but it is because of the cache. I need to
> optimize the first-run.
>
> --------------------------------------------------
> Hardware:
>
> laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM
>
> --------------------------------------------------
> Version:
>
> PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
> 4.4.1-4ubuntu9) 4.4.1, 32-bit
>
> compiled from sources, only --prefix=... argument given to ./configure
>
> --------------------------------------------------
> Schema:
>
> CREATE TABLE hyps (
>        docid INT,
>        wordid INT,
>        positions INT[],
>        weights REAL[],
>        length INT,
>        total_weight REAL
> );
> COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
> CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
> CREATE INDEX hyps_docid_index ON hyps USING hash (docid);
>
> shared_buffers = 300MB ...this is the only thing I changed in the config
>
> I tried that also with btree indices instead of hash and surprisingly
> the SELECT query was a bit faster. I would expect hash to be faster.
>
> The index on 'docid' column is there because I need to be able to
> search also in a particular document or in a set of documents.
> --------------------------------------------------
> Table info:
>
> - rows = 5490156
> - average length of positions vectors = 19.5
> - total number of items in positions vectors = 107444304
> - positions and weights in one row have the same number of items, but
> for each row the number may differ.
> - table data are loaded only once (using COPY) and are not modified anymore
> - there are 369 various docid and 161460 various wordid
> - VACUUM was executed after COPY of data
>
> --------------------------------------------------
> Query:
>
> EXPLAIN ANALYZE SELECT h1.docid
> FROM hyps AS h1
> WHERE h1.wordid=65658;
>
>  Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
> (actual time=62.106..4416.864 rows=343 loops=1)
>   Recheck Cond: (wordid = 65658)
>   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
> rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
>         Index Cond: (wordid = 65658)
>  Total runtime: 4432.015 ms
>
> The result has 343 rows and there are 9294 items in positions vectors in total.
>
> --------------------------------------------------
> Comparison with Lucene:
>
> If I run the same query in Lucene search engine, it takes 0.105
> seconds on the same data which is quite a huge difference.
>
> --------------------------------------------------
> Synthetic data set:
>
> If you want to try it yourself, here is a script which generates the
> data for COPY command. I don't know whether it is possible to send
> attachments here, so I put the script inline. Just save it as
> create_synthetic_data.pl and run it by 'perl
> create_synthetic_data.pl'. With these synthetic data the SELECT query
> times are around 2.5 seconds. You can try the SELECT query with
> 'wordid' equal 1, 2, 3, ...10000.
>
>
> #!/usr/bin/perl
> # Create synthetic data for PostgreSQL COPY.
>
> $rows = 5490156;
> $docs = 369;
> $words = 161460;
> $docid = 0;
> $wordid = 0;
>
> for ($row=0; $row<$rows; $row++) {
>
>        my $sep          = "";
>        my $positions    = "";
>        my $weights      = "";
>        my $total_weight = 0;
>        my $items        = int(rand(39))+1;
>
>        if ($row % int($rows/$docs) == 0) {
>                $docid++;
>                $wordid = 0;
>        }
>        $wordid++;
>
>        for ($i=0; $i<$items; $i++) {
>                $position      = int(rand(20000));
>                $weight        = rand(1);
>                $positions    .= $sep.$position;
>                $weights      .= $sep.sprintf("%.3f", $weight);
>                $total_weight += $weight;
>                $sep           = ",";
>        }
>        print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
> }
>
>
> If you need any other info, I will gladly provide it.
>
> Thank You for Your time.
> Miso Fapso
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Sachin Kumar 2010-07-02 05:40:13 Performance issues with postgresql-8.4.0: Query gets stuck sometimes
Previous Message Tom Lane 2010-07-02 04:08:50 Re: Highly Efficient Custom Sorting