From: | Rob Sargent <robjsargent(at)gmail(dot)com> |
---|---|
To: | Israel Brewster <ijbrewster(at)alaska(dot)edu> |
Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Growth planning |
Date: | 2021-10-04 17:22:08 |
Message-ID: | 8583b994-130a-71a9-c4a6-73cf953ac7f6@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 10/4/21 11:09 AM, Israel Brewster wrote:
>> On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsargent(at)gmail(dot)com
>> <mailto:robjsargent(at)gmail(dot)com>> wrote:
>>
>>> On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrewster(at)alaska(dot)edu
>>> <mailto:ijbrewster(at)alaska(dot)edu>> wrote:
>> Guessing the “sd” is "standard deviation”? Any chance those stddevs
>> are easily calculable from base data? Could cut your table size in
>> half (and put those 20 cores to work on the reporting).
>
> Possible - I’d have to dig into that with the script author. I was
> just handed an R script (I don’t work with R…) and told here’s the
> data it needs, here’s the output we need stored in the DB. I then
> spent just enough time with the script to figure out how to hook up
> the I/O. The schema is pretty much just a raw dump of the output - I
> haven’t really spent any resources figuring out what, exactly, the
> data is. Maybe I should :-)
>
>> And I wonder if the last three indices are strictly necessary? They
>> take disc space too.
>
> Not sure. Here’s the output from pg_stat_all_indexes:
>
> volcano_seismology=# select * from pg_stat_all_indexes where
> relname='data';
> relid | indexrelid | schemaname | relname | indexrelname |
> idx_scan | idx_tup_read | idx_tup_fetch
> -------+------------+------------+---------+---------------------------+----------+--------------+---------------
> 19847 | 19869 | public | data | data_pkey |
> 0 | 0 | 0
> 19847 | 19873 | public | data |
> date_station_channel_idx | 811884 | 12031143199 | 1192412952
> 19847 | 19875 | public | data | station_channel_epoch_idx
> | 8 | 318506 | 318044
> 19847 | 19876 | public | data | station_data_idx
> | 9072 | 9734 | 1235
> 19847 | 19877 | public | data | station_date_idx
> | 721616 | 10927533403 | 10908912092
> 19847 | 20479 | public | data |
> data_station_channel_idx | 47293 | 194422257262 | 6338753379
> (6 rows)
>
> so they *have* been used (although not the station_data_idx so much),
> but this doesn’t tell me when it was last used, so some of those may
> be queries I was experimenting with to see what was fastest, but are
> no longer in use. Maybe I should keep an eye on this for a while, see
> which values are increasing.
>
>>
>> But my bet is you’re headed for partitioning on datetime or perhaps
>> station.
>
> While datetime partitioning seems to be the most common, I’m not clear
> on how that would help here, as the most intensive queries need *all*
> the datetimes for a given station, and even the smaller queries would
> be getting an arbitrary time range potentially spanning several, if
> not all, partitions. Now portioning on station seems to make sense -
> there are over 100 of those, and pretty much any query will only deal
> with a single station at a time. Perhaps if more partitioning would be
> better, portion by both station and channel? The queries that need to
> be fastest will only be looking at a single channel of a single station.
>
> I’ll look into this a bit more, maybe try some experimenting while I
> still have *relatively* little data. My main hesitation here is that
> in the brief look I’ve given partitioning so far, it looks to be a
> royal pain to get set up. Any tips for making that easier?
>
>
If no queries address multiple stations you could do a table per
station. Doesn't smell good but you have a lot of data and well, speed
kills.
I think the date-station-channel could "take over" for the
station-date. Naturally the latter is chosen if you give just the two
fields, but I would be curious to see how well the former performs given
just its first two fields(when station-date doesn't exist).
From | Date | Subject | |
---|---|---|---|
Next Message | Adrian Klaver | 2021-10-04 17:28:34 | Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres |
Previous Message | Shaozhong SHI | 2021-10-04 17:10:29 | Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres |