From: | Richard Huxton <dev(at)archonet(dot)com> |
---|---|
To: | Scott Cain <cain(at)cshl(dot)org>, PgSQL Performance ML <pgsql-performance(at)postgresql(dot)org> |
Cc: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: [SQL] EXTERNAL storage and substring on long strings |
Date: | 2003-08-04 15:55:48 |
Message-ID: | 200308041655.48151.dev@archonet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance pgsql-sql |
On Monday 04 August 2003 16:25, Scott Cain wrote:
[snip]
> In an effort to find the best way to do this operation, I decided to
> look at what is my "worst case" scenario: the DNA sequence for human
> chromosome 1, which is about 250 million characters long (previous
> strings where about 20 million characters long). I wrote a perl script
> to do several substring operations over this very long string, with
> substring lengths varying between 1000 and 40,000 characters spread out
> over various locations along the string. While EXTENDED storage won in
> this case, it was a hollow victory: 38 seconds per operation versus 40
> seconds, both of which are way too long to for an interactive
> application.
>
> Time for a new method. A suggestion from my boss was to "shred" the DNA
> into smallish chunks and a column giving offsets from the beginning of
> the string, so that it can be reassembled when needed. Here is the test
> table:
>
> string=> \d dna
> Table "public.dna"
> Column | Type | Modifiers
> ---------+---------+-----------
> foffset | integer |
> pdna | text |
> Indexes: foffset_idx btree (foffset)
[snipped plpgsql function which stitches chunks together and then substrings]
> So here's the question: I've never written a plpgsql function before, so
> I don't have much experience with it; is there anything obviously wrong
> with this function, or are there things that could be done better? At
> least this appears to work and is much faster, completing substring
> operations like above in about 0.27 secs (that's about two orders of
> magnitude improvement!)
You might want some checks to make sure that smin < smax, otherwise looks like
it does the job in a good clean fashion.
Glad to hear it's going to solve your problems. Two things you might want to
bear in mind:
1. There's probably a "sweet spot" where the chunk size interacts well with
your data, usage patterns and PGs backend to give you peak performance.
You'll have to test.
2. If you want to search for a sequence you'll need to deal with the case
where it starts in one chunk and ends in another.
--
Richard Huxton
Archonet Ltd
From | Date | Subject | |
---|---|---|---|
Next Message | scott.marlowe | 2003-08-04 16:07:01 | Re: [PERFORM] OSDL Database Test Suite 3 is available on PostgreSQL |
Previous Message | Tom Lane | 2003-08-04 15:53:43 | Re: [SQL] EXTERNAL storage and substring on long strings |
From | Date | Subject | |
---|---|---|---|
Next Message | Scott Cain | 2003-08-04 16:14:06 | Re: [SQL] EXTERNAL storage and substring on long strings |
Previous Message | Tom Lane | 2003-08-04 15:53:43 | Re: [SQL] EXTERNAL storage and substring on long strings |