From: | P Kishor <punk(dot)kish(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | optimizing daily data storage in Pg |
Date: | 2010-07-22 14:41:12 |
Message-ID: | AANLkTimfy5zntg4mJbK184AT-bpBSoqovtI8BvNRFQ17@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I have been struggling with this for a while now, have even gone down
a few paths but struck out, so I turn now to the community for ideas.
First, the problem: Store six daily variables for ~ 25 years for cells
in a grid.
* Number of vars = 6
* Number of cells ~ 13 million
* Number of days ~ 9125 (25 * 365)
Optimize the store for two different kinds of queries:
Query one: Retrieve the value of a single var for all or a portion of
the cells for a single day. This is analogous to an image where every
pixel is the value of a single var.
Query two: Retrieve values for all the days or a duration of days for
a single var for a single cell. This is like grabbing a column out of
a table in which each row holds all the vars for a single day.
So, I set about designing the db. The "grid" is in a table with 13 million rows
CREATE TABLE cells (
cell_id INTEGER,
other_data ..
)
WITH (
OIDS=FALSE
)
A single table *where every row is one day's values for one cell* looks like so
CREATE TABLE d (
yr SMALLINT,
yday SMALLINT,
a SMALLINT,
b SMALLINT,
d SMALLINT,
e SMALLINT,
f SMALLINT,
g SMALLINT,
cell_id INTEGER
)
WITH (
OIDS=FALSE
)
The data would look like so
yr yday a b c d e f g cell_id
----------------------------------------------------
1980 1 x x x x x x x 1
..
1980 365 x x x x x x x 1
...
1981 1 x x x x x x x 1
..
1981 365 x x x x x x x 1
...
...
2005 1 x x x x x x x 1
..
2005 365 x x x x x x x 1
......
1980 1 x x x x x x x 2
..
1980 365 x x x x x x x 2
...
I could now (theoretically) conduct my queries like so:
Query 1a: Retrieve the value of a single var for all the cells for a
single day. This is analogous to an image where every pixel is the
value of a single var.
SELECT <var> FROM d WHERE yr = ? AND yday = ?;
I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday).
Query 1b: Retrieve the value of a single var for a portion of the
cells for a single day. This is analogous to an image where every
pixel is the value of a single var.
SELECT <var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN (?,?,?...);
I assuming I would need an index on yr and yday, or perhaps even a
compound index on (yr, yday) AND an index on cell_id.
Query 2: Retrieve values for all the days or a duration of days for a
single var for a single cell. This is like grabbing a column out of a
table in which each row holds all the vars for a single day.
SELECT <var> FROM d WHERE cell_id = ?;
SELECT <var> FROM d WHERE cell_id IN (?,?,?...);
Once again, an index on cell_id would assist in the above.
The problem: The above table would have 13 M * 9125 rows ~ 118 billion
rows. Huge indexes, slow queries, etc. In fact, major issues loading
the data in the first place. Since I am loading data in batches, I
drop the indexes (takes time), COPY data into the table (takes time),
build the indexes (takes time), experiment with improving the
performance (takes time), fail, rinse, lather, repeat. I actually
tried the above with a subset of data (around 100 M rows) and
experienced all of the above. I don't remember the query times, but
they were awful.
So, I partitioned the table into years like so
CREATE TABLE d_<yyyy> (
CHECK ( yr = <yyyy> )
) INHERITS (d)
Hmmm... still no satisfaction. I ended up with 1 master table + 25
inherited tables. Each of the year tables now had ~ 4.75 billion rows
(13 M * 365), and the queries were still very slow.
So, I partitioned it all by years and days like so
CREATE TABLE d_<yyyy>_<yday> (
CHECK ( yr = <yyyy> AND yday = <yday> )
) INHERITS (d)
Each table now has 13 million rows, and is reasonably fast (although
still not satisfactorily fast), but now I have 9K tables. That has its
own problems. I can't query the master table anymore as Pg tries to
lock all the tables and runs out of memory. Additionally, I can't
anymore conduct query two above. I could do something like
SELECT a FROM d_1980_1 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_2 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_3 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_4 WHERE cell_id = 1
UNION
...
But the above is hardly optimal.
Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg,
or even a RDBMS, is not the right tool for this problem, in which
case, suggestion for alternatives would be welcome as well.
Right now I am testing this on a dual Xeon dual core 3 GHz Xserve with
12 GB RAM. The PGDATA directory is located on an attached RAID that is
configured as RAID5. Reasonable time for a query would be under 500
ms, although ultimately I would love to have the query be done under
250 ms, perhaps with RAID10, and a machine with more RAM. I have
access to a machine with dual Xeon quad core 3 GHz Xserve with 32 GB
RAM, and an internal RAID, but before I try to move the data, I want
to actually conclusively prove that Pg is the best solution (or not).
Perhaps flat files are better, perhaps blobs are better, or perhaps
Pg's array column type. Any ideas/suggestions welcome.
Oh, one more thing. Once the data are loaded and everything is
working, the data are readonly. They are historical, so they don't
need to be changed.
--
Puneet Kishor
From | Date | Subject | |
---|---|---|---|
Next Message | Matthew Wilson | 2010-07-22 14:45:45 | How to improve performance in reporting database? |
Previous Message | Peter Hunsberger | 2010-07-22 14:35:08 | Re: Bitmask trickiness |