From: | "TJ O'Donnell" <tjo(at)acm(dot)org> |
---|---|
To: | <pgsql-general(at)postgresql(dot)org> |
Cc: | <DCorbit(at)connx(dot)com> |
Subject: | Re: visualizing B-tree index coverage |
Date: | 2005-01-25 23:50:05 |
Message-ID: | 1540.209.223.166.104.1106697005.squirrel@gnova.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Since I'm using a multi-column index, I can greatly influence
the nature of the index created, depending on which columns I use
and how many. I'm searching for an optimal set
of columns that creates an index that, for sure does not have
every value the same, nor only two values. Instead, I want to see
how well I've spread the index out over the data (if that phrasing makes sense).
More specifically, I have character data representing molecular structures.
I've written (rather slow) search functions. I can create any number of
columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
# single bonds, etc. I expect my fingerprints will not be unique (fingerprint may
be a poor analogy), but rather will classify similar structures together.
I create a multi-column index on these counts and
get about 2-3 times speedup using 13 columns right now.
For example:
select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec.
select count(smiles) from structure where
(_c, _n, _o, _s, _p, _halo,
_arom_c, _arom_n, _arom_o, _arom_s,
_atoms, _single_bonds, _other_bonds) >=
( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds
when the (_c, etc.) is a multi-column index.
The data isn't inherently structured in any way that invites some particular number of columns
for indexing. I don't want to use too many, nor too few columns. I also
want to optimize the nature(which atom types, bond types, etc.)
of the count columns. While I could do this
and use the speedup as the measure of success, I think
that if my B-tree were "covering" the data well, I would get the best results.
Covering means finding that optimal situation where there is not one index for all rows
and also not a unique index for every row - something inbetween would be ideal,
or is that basically a wrong idea?
TJ
> Useful explanation of PostgreSQL index format:
> http://www.faqs.org/docs/ppbook/c13329.htm
>
> I think you are aiming for the wrong thing.
> The worst possible index is one with every value the same.
> The second worst (still basically useless) is one with only two values. The greater the
> differentiation of the data, the more workload is
> reduced on a search.
>
> Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the
> index should be a problem.
>
> Do you have data or experience that shows otherwise?
>
> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org
> [mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25,
> 2005 2:19 PM
> To: pgsql-general(at)postgresql(dot)org
> Cc: tjo(at)acm(dot)org
> Subject: [GENERAL] visualizing B-tree index coverage
>
> Does anyone know of a tools that allows one to visualize
> the tree created by a multi-column B-tree index?
> A picture of a tree with branches, showing how "branchy" the
> tree is would be great.
> I'm wondering how well I've "clustered" the data in my table
> using the multi-column index. In other words, do my
> multi-columns sufficiently but not overly discriminate rows from each other?
> Do I have too many with the same index? (not enough branches)
> Do I have a unique index for each row? (way too many branches)
>
> Thanks,
> TJ
>
>
>
> ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to
> increase your free space map settings
From | Date | Subject | |
---|---|---|---|
Next Message | Jim C. Nasby | 2005-01-25 23:52:47 | Re: difficult JOIN |
Previous Message | John Gray | 2005-01-25 23:48:46 | Re: xpath_list() question for contrib/xml2 |