Re: Growth planning

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: Israel Brewster <ijbrewster(at)alaska(dot)edu>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Growth planning
Date: 2021-10-04 17:22:08
Message-ID: 8583b994-130a-71a9-c4a6-73cf953ac7f6@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 10/4/21 11:09 AM, Israel Brewster wrote:
>> On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsargent(at)gmail(dot)com
>> <mailto:robjsargent(at)gmail(dot)com>> wrote:
>>
>>> On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrewster(at)alaska(dot)edu
>>> <mailto:ijbrewster(at)alaska(dot)edu>> wrote:
>> Guessing the “sd” is "standard deviation”?  Any chance those stddevs
>> are easily calculable from base data?  Could cut your table size in
>> half (and put those 20 cores to work on the reporting).
>
> Possible - I’d have to dig into that with the script author. I was
> just handed an R script (I don’t work with R…) and told here’s the
> data it needs, here’s the output we need stored in the DB. I then
> spent just enough time with the script to figure out how to hook up
> the I/O. The schema is pretty much just a raw dump of the output - I
> haven’t really spent any resources figuring out what, exactly, the
> data is. Maybe I should :-)
>
>>  And I wonder if the last three indices are strictly necessary? They
>> take disc space too.
>
> Not sure. Here’s the output from pg_stat_all_indexes:
>
> volcano_seismology=# select * from pg_stat_all_indexes where
> relname='data';
>  relid | indexrelid | schemaname | relname |   indexrelname        |
> idx_scan | idx_tup_read | idx_tup_fetch
> -------+------------+------------+---------+---------------------------+----------+--------------+---------------
>  19847 |      19869 | public     | data    | data_pkey           |   
>     0 |            0 |             0
>  19847 |      19873 | public     | data    |
> date_station_channel_idx  |   811884 |  12031143199 |   1192412952
>  19847 |      19875 | public     | data    | station_channel_epoch_idx
> |        8 |       318506 |   318044
>  19847 |      19876 | public     | data    | station_data_idx       
>   |     9072 |         9734 |   1235
>  19847 |      19877 | public     | data    | station_date_idx       
>   |   721616 |  10927533403 |   10908912092
>  19847 |      20479 | public     | data    |
> data_station_channel_idx  |    47293 | 194422257262 |   6338753379
> (6 rows)
>
> so they *have* been used (although not the station_data_idx so much),
> but this doesn’t tell me when it was last used, so some of those may
> be queries I was experimenting with to see what was fastest, but are
> no longer in use. Maybe I should keep an eye on this for a while, see
> which values are increasing.
>
>>
>> But my bet is you’re headed for partitioning on datetime or perhaps
>> station.
>
> While datetime partitioning seems to be the most common, I’m not clear
> on how that would help here, as the most intensive queries need *all*
> the datetimes for a given station, and even the smaller queries would
> be getting an arbitrary time range potentially spanning several, if
> not all, partitions. Now portioning on station seems to make sense -
> there are over 100 of those, and pretty much any query will only deal
> with a single station at a time. Perhaps if more partitioning would be
> better, portion by both station and channel? The queries that need to
> be fastest will only be looking at a single channel of a single station.
>
> I’ll look into this a bit more, maybe try some experimenting while I
> still have *relatively* little data. My main hesitation here is that
> in the brief look I’ve given partitioning so far, it looks to be a
> royal pain to get set up. Any tips for making that easier?
>
>
If no queries address multiple stations you could do a table per
station.  Doesn't smell good but you have a lot of data and well, speed
kills.

I think the date-station-channel could "take over" for the
station-date.  Naturally the latter is chosen if you give just the two
fields, but I would be curious to see how well the former performs given
just its first two fields(when station-date doesn't exist).

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2021-10-04 17:28:34 Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres
Previous Message Shaozhong SHI 2021-10-04 17:10:29 Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres