From: | Antonios Christofides <anthony(at)itia(dot)ntua(dot)gr> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Trading off large objects (arrays, large strings, large tables) for timeseries |
Date: | 2005-02-15 14:33:19 |
Message-ID: | 20050215143319.GA3258@itia.ntua.gr |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
My questions briefly:
(1) I made experiments with large (millions of rows/elements) arrays
of text (text[], each element is 20-30 characters). On 7.4 (Debian
Sarge prepackaged), inserting such an array takes forever (10
thousand elements per minute), but accessing, or writing an
element, or appending an element, is done instantly. On 8.0.1
(compiled by me), inserting is very fast (two million per minute,
on the same machine of course), but selecting any element takes
more than a second, and updating any row, or appending, takes 25
seconds.
Why 25 seconds for appending an element? Does it rewrite the
entire array?
(2) I also tried using a large (80M) text instead (i.e. instead of
storing an array of lines, I store a huge plain text file). What
surprised me is that I can get the 'tail' of the file (using
substring) in only around one second, although it is transparently
compressed (to 17M). It doesn't decompress the entire string, does
it? Does it store it somehow chunked?
It also takes 25 seconds to append something (update ... set
string=string||'...'). Is there any way to do clever, faster
appending, like I can in a text file?
What I'm trying to do is find a good way to store timeseries. A
timeseries is essentially a series of (date, value) pairs, and more
specifically it is an array of records, each record consisting of
three items: date TIMESTAMP, value DOUBLE PRECISION, flags TEXT. The
flags is null in more than 99% of the records, but occasionally it
contains flags or short comments. My above experiments are with
comma-separated values in plain ascii; I haven't experimented with an
array of composite type yet.
The most important operations are:
(1) Retrieving or inserting/replacing the entire timeseries
(2) Selecting the last record
(3) Appending a record
I'm not interested in selecting part of a timeseries, or updating a
record in the middle. Such operations do happen, but you can do them
by selecting/replacing the entire timeseries instead. But (2) and (3)
will be very frequent, and I don't want to do them through (1).
I'm also considering a table, of course, where each timeseries record
will be one row. I have experimented only a little with that (id
integer, date timestamp, value double precision, flags text, primary
key(id, date)). It appears to be fast alright, a problem being that it
appears to consume much disk space (about 4 times more than an array
of text, which is about 4 times more than a single huge text).
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2005-02-15 14:47:14 | Re: database encoding "WIN" -- Western or Cyrillic? |
Previous Message | Scott Marlowe | 2005-02-15 14:29:08 | Re: Lost rows/data corruption? |