From: | Hannu Krosing <hannu(at)tm(dot)ee> |
---|---|
To: | Troels Arvin <troels(at)arvin(dot)dk> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Adding a suffix array index |
Date: | 2004-11-19 12:38:20 |
Message-ID: | 1100867900.3919.15.camel@fuji.krosing.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On R, 2004-11-19 at 12:42, Troels Arvin wrote:
> The basic parts of the type are pretty much done. Those interested may
> have a look at http://troels.arvin.dk/svn-snap/postgresql-dnaseq/ (the
> code organization needs some clean-up). The basic type implementation
> should be improved by adding more string functions and by implementing a
> set of specialized selectivity functions.
I cant answer your immediate questions, just rant on general issues ;)
> Part of my current code concerns packing DNA characters: As the alphabet
> of DNA strings is very small (four characters), it seems like a
> straigt-forward optimization to store each character in two bits.
My advice would be to get it to work first, oprimize later.
Thus I guess you could start by storing DNA sequences as character
strings.
> A warning: This is my first C project, so please
> don't laugh too much (publicly) if you find strange constructs in my code...
Then even more so - get the novel and generally interesting part
(indexing huge arrays) right first, and optimise for space (compressing
4 chars into 1) later.
You could do this 4->1 compression when storing the string into tuple,
but I strongly recommend doing actual work (indexing/searching) at a
level that C directly supports (i.e. bytes/characters)
This enables you to get the basics right first without distraction from
all bit-shifting inside bytes. A good tuned algorithm will almost
certainly offset the 4 time space disadvantage.
...
> My first and most immediate goal is to support efficient answering of a
> question like "which rows contain the sequence TTGACCACTTG in column foo?".
If you store your sequences as strings, you may try to use trigrams (or
modify them to 4,5,6 or 7-grams ;) to get some feel how that works.
trigram module is in contrib/pg_trgm.
------------
Hannu
From | Date | Subject | |
---|---|---|---|
Next Message | Adam Witney | 2004-11-19 12:45:43 | Re: Adding a suffix array index |
Previous Message | Oleg Bartunov | 2004-11-19 12:35:06 | Re: Adding a suffix array index |