Re: testing/predicting optimization using indexes

From: PFC <lists(at)boutiquenumerique(dot)com>
To: tjo(at)acm(dot)org
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: testing/predicting optimization using indexes
Date: 2005-01-26 23:07:02
Message-ID: opsk8ft0qzth1vuj@musicbox
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql


> Finally, I built the table with all the additional columns created
> during the initial creation of the table. The original speed was
> obtained!

Quite strange !
Did you vacuum full ? analyze ? Did you set a default value for the
columns ? mmm.... maybe it's not the fact of adding the columns, but the
fact of filling them with values, which screws up the vacuum if your fsm
setting is too small ?
Try vacuum verbose, good luck parsing the results ;)

> The secondary issue was one of using an index on the additional columns.
> This greatly speeds up the overall search, by limiting the number of
> rows needing to use oe_matches. I am currently working on optimizing the
> number and nature of these extra columns. However, my initial question
> still remains. Once I find a good set of columns to use as an index,
> will I then get even greater speed by defining a new data type and an
> index method equivalent to my multi-column index?

You'll know that by counting the rows matched by the pre-filter (your
columns), counting the rows actually matched, which will give you the
number of calls to oe_match you saved, then look at the mean time for
oe_match...

> SELECT count(*) FROM structure
> 237597
>
> SELECT avg(length(smiles)) FROM structure
> 37.6528912402092619

Well, your rows have 26 bytes header + then about 45 bytes of TEXT, and 4
bytes per integer column... I don't think the bytes spent in your columns
are significant... They could have been if your smiles string had been
shorter.

> EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM
> structure
> Seq Scan on structure (cost=0.00..7573.96 rows=237597 width=41) (actual
> time=17.443..15025.974 rows=237597 loops=1)
> Total runtime: 16786.542 ms
>
> EXPLAIN ANALYZE SELECT smiles FROM structure
> Seq Scan on structure (cost=0.00..6979.97 rows=237597 width=41) (actual
> time=0.067..735.884 rows=237597 loops=1)
> Total runtime: 1200.661 ms

OK so it takes 1.2 secs to actually read the data, and 16.8 secs to run
oe_match... so a call is about 65 microseconds... Note that this time
could depend a lot on the smiles column and also on the query string !

What you need now is to estimate the selectivity of your pre filtering
columns, to be able to select the best possible columns : for various
smiles queries, compute the row count which gets past the filter, and the
row count that actually matches the oe_match. Ideally you want the first
to be as close as possible to the second, but for your test query, as you
return 0.5% of the table, even an inefficient pre-filter which would let
10% of the rows through would yield a 10x speed improvement. You'd want to
get below the 2-3% bar so that postgres will use an index scan, which will
be even faster. Don't forget to do a sanity-check that all the rows that
match your smiles query also match your columns filter !

Also, using several columns (say a,b,c,d) is not optimal. Say a,b,c,d
each contain integers between 0 and 10 with linear distribution ; then a
query starting with 'a>=0' will automatically match more than 90% of the
data and not use the index. You'll get a seq scan. So, either you can
always get your first column very selective, or you'll have to use a gist
index and integer arrays.

If you get times that you like, then you're done ; else there may be
another path for optimization, getting your hands dirty in the code, but
not to the point of creating index types :

You'll have noted that the 'c1ccccc1CC(=O)NC' string gets reparsed for
every processed row. You should benchmark how much time is lost in this
parsing. You probably won't be able to do this with postgres (maybe
matching 'c1ccccc1CC(=O)NC' with an empty smiles string ?), so you may
have to call the C++ functions directly.
If this time is significant, you might want to create a datatype which
will contain a compiled query string. You'll have to write a few C
functions for that (dont ask me) but it should be a lot simpler than
coding a new index type. Then you'd create a special version of oe_match
which would take a precompiled query string. Depending on the time
necessary to parse it, it may work.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Ralph Graulich 2005-01-27 01:42:33 Rule problem with OLD / NEW record set
Previous Message TJ O'Donnell 2005-01-26 22:29:23 Re: testing/predicting optimization using indexes