Re: [SQL] EXTERNAL storage and substring on long strings

From: Richard Huxton <dev(at)archonet(dot)com>
To: Scott Cain <cain(at)cshl(dot)org>, PgSQL Performance ML <pgsql-performance(at)postgresql(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: [SQL] EXTERNAL storage and substring on long strings
Date: 2003-08-04 15:55:48
Message-ID: 200308041655.48151.dev@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance pgsql-sql

On Monday 04 August 2003 16:25, Scott Cain wrote:
[snip]
> In an effort to find the best way to do this operation, I decided to
> look at what is my "worst case" scenario: the DNA sequence for human
> chromosome 1, which is about 250 million characters long (previous
> strings where about 20 million characters long). I wrote a perl script
> to do several substring operations over this very long string, with
> substring lengths varying between 1000 and 40,000 characters spread out
> over various locations along the string. While EXTENDED storage won in
> this case, it was a hollow victory: 38 seconds per operation versus 40
> seconds, both of which are way too long to for an interactive
> application.
>
> Time for a new method. A suggestion from my boss was to "shred" the DNA
> into smallish chunks and a column giving offsets from the beginning of
> the string, so that it can be reassembled when needed. Here is the test
> table:
>
> string=> \d dna
> Table "public.dna"
> Column | Type | Modifiers
> ---------+---------+-----------
> foffset | integer |
> pdna | text |
> Indexes: foffset_idx btree (foffset)

[snipped plpgsql function which stitches chunks together and then substrings]

> So here's the question: I've never written a plpgsql function before, so
> I don't have much experience with it; is there anything obviously wrong
> with this function, or are there things that could be done better? At
> least this appears to work and is much faster, completing substring
> operations like above in about 0.27 secs (that's about two orders of
> magnitude improvement!)

You might want some checks to make sure that smin < smax, otherwise looks like
it does the job in a good clean fashion.

Glad to hear it's going to solve your problems. Two things you might want to
bear in mind:
1. There's probably a "sweet spot" where the chunk size interacts well with
your data, usage patterns and PGs backend to give you peak performance.
You'll have to test.
2. If you want to search for a sequence you'll need to deal with the case
where it starts in one chunk and ends in another.

--
Richard Huxton
Archonet Ltd

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message scott.marlowe 2003-08-04 16:07:01 Re: [PERFORM] OSDL Database Test Suite 3 is available on PostgreSQL
Previous Message Tom Lane 2003-08-04 15:53:43 Re: [SQL] EXTERNAL storage and substring on long strings

Browse pgsql-sql by date

  From Date Subject
Next Message Scott Cain 2003-08-04 16:14:06 Re: [SQL] EXTERNAL storage and substring on long strings
Previous Message Tom Lane 2003-08-04 15:53:43 Re: [SQL] EXTERNAL storage and substring on long strings