From: | Andy Colson <andy(at)squeakycode(dot)net> |
---|---|
To: | P Kishor <punk(dot)kish(at)gmail(dot)com> |
Cc: | PostgreSQL <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: optimizing daily data storage in Pg |
Date: | 2010-07-23 13:48:11 |
Message-ID: | 4C499D9B.3050101@squeakycode.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 7/23/2010 12:39 AM, P Kishor wrote:
> On Thu, Jul 22, 2010 at 4:56 PM, Andy Colson<andy(at)squeakycode(dot)net> wrote:
>> On 7/22/2010 9:41 AM, P Kishor wrote:
>>>
>>> I have been struggling with this for a while now, have even gone down
>>> a few paths but struck out, so I turn now to the community for ideas.
>>> First, the problem: Store six daily variables for ~ 25 years for cells
>>> in a grid.
>>>
>>> * Number of vars = 6
>>> * Number of cells ~ 13 million
>>> * Number of days ~ 9125 (25 * 365)
>>>
>>> Optimize the store for two different kinds of queries:
>>>
>>> Query one: Retrieve the value of a single var for all or a portion of
>>> the cells for a single day. This is analogous to an image where every
>>> pixel is the value of a single var.
>>
>>> SELECT<var> FROM d WHERE yr = ? AND yday = ?;
>>> SELECT<var> FROM d WHERE yr = ? AND yday = ? AND cell_id IN
>>> (?,?,?...);
>>
>>
>>>
>>> Query two: Retrieve values for all the days or a duration of days for
>>> a single var for a single cell. This is like grabbing a column out of
>>> a table in which each row holds all the vars for a single day.
>>
>>> SELECT<var> FROM d WHERE cell_id = ?;
>>> SELECT<var> FROM d WHERE cell_id IN (?,?,?...);
>>
>>
>>
>> First, I must admit to not reading your entire email.
>
> I am not sure how to respond to your feedback give that you haven't
> read the entire email. Nevertheless, thanks for writing...
>
Heh, sorry, my point was, you put a lot of information into your email,
and I was going to only use one bit of it: row counts.
>>
>> Second, Query 1 should be fast, regardless of how you layout the tables.
>
> It is not fast. Right now I have data for about 250,000 cells loaded.
> That comes to circa 92 million rows per year. Performance is pretty
> sucky.
>
>
This query should return one record, correct? This should be very fast,
PG should be able to find the record in the index within 5 seeks, and
then find the data in one seek. Can you post 'explain analyze' for this
query. (Or, it could be the case, I totally misunderstood your data)
>>
>> Third, Query 2 will return 13M rows? I dont think it matters how you layout
>> the tables, returning 13M rows is always going to be slow.
>>
>
> Yes, I understand that. In reality I will never get 13 M rows. For
> display purposes, I will probably get around 10,000 rows to 50,000
> rows. When more rows are needed, it will be to feed a model, so that
> can be offline (without an impatient human being waiting on the other
> end).
>
> Right now, my main problem is that I have either too many rows (~4 B
> rows) in a manageable number of tables (25 tables) or manageable
> number of rows (~13 M rows) in too many tables (~9000 tables).
>
Searching by just cell_id is not going to be very selectable, and with
large result-sets I can see this one being slow. As Scott talked about
in his response, this one will come down to hardware. Have you dd
tested your hardware?
-Andy
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Smith | 2010-07-23 15:38:26 | Re: Information Extract |
Previous Message | Edmundo Robles L. | 2010-07-23 13:42:07 | Re: Question about SCO openserver and postgres... |