Quick Links

Re: Growth planning

From:	Rob Sargent <robjsargent(at)gmail(dot)com>
To:	Israel Brewster <ijbrewster(at)alaska(dot)edu>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Growth planning
Date:	2021-10-04 17:22:08
Message-ID:	8583b994-130a-71a9-c4a6-73cf953ac7f6@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 10/4/21 11:09 AM, Israel Brewster wrote:
>> On Oct 4, 2021, at 8:46 AM, Rob Sargent <robjsargent(at)gmail(dot)com
>> <mailto:robjsargent(at)gmail(dot)com>> wrote:
>>
>>> On Oct 4, 2021, at 10:22 AM, Israel Brewster <ijbrewster(at)alaska(dot)edu
>>> <mailto:ijbrewster(at)alaska(dot)edu>> wrote:
>> Guessing the “sd” is "standard deviation”? Any chance those stddevs
>> are easily calculable from base data? Could cut your table size in
>> half (and put those 20 cores to work on the reporting).
>
> Possible - I’d have to dig into that with the script author. I was
> just handed an R script (I don’t work with R…) and told here’s the
> data it needs, here’s the output we need stored in the DB. I then
> spent just enough time with the script to figure out how to hook up
> the I/O. The schema is pretty much just a raw dump of the output - I
> haven’t really spent any resources figuring out what, exactly, the
> data is. Maybe I should :-)
>
>> And I wonder if the last three indices are strictly necessary? They
>> take disc space too.
>
> Not sure. Here’s the output from pg_stat_all_indexes:
>
> volcano_seismology=# select * from pg_stat_all_indexes where
> relname='data';
> relid | indexrelid | schemaname | relname |   indexrelname   |
> idx_scan | idx_tup_read | idx_tup_fetch
> -------+------------+------------+---------+---------------------------+----------+--------------+---------------
> 19847 |   19869 | public     | data   | data_pkey   |
>   0 |   0 |     0
> 19847 |   19873 | public     | data   |
> date_station_channel_idx  |   811884 |  12031143199 |   1192412952
> 19847 |   19875 | public     | data   | station_channel_epoch_idx
> |   8 |     318506 |   318044
> 19847 |   19876 | public     | data   | station_data_idx
>   |     9072 |     9734 |   1235
> 19847 |   19877 | public     | data   | station_date_idx
>   |   721616 |  10927533403 |   10908912092
> 19847 |   20479 | public     | data   |
> data_station_channel_idx  |   47293 | 194422257262 |   6338753379
> (6 rows)
>
> so they *have* been used (although not the station_data_idx so much),
> but this doesn’t tell me when it was last used, so some of those may
> be queries I was experimenting with to see what was fastest, but are
> no longer in use. Maybe I should keep an eye on this for a while, see
> which values are increasing.
>
>>
>> But my bet is you’re headed for partitioning on datetime or perhaps
>> station.
>
> While datetime partitioning seems to be the most common, I’m not clear
> on how that would help here, as the most intensive queries need *all*
> the datetimes for a given station, and even the smaller queries would
> be getting an arbitrary time range potentially spanning several, if
> not all, partitions. Now portioning on station seems to make sense -
> there are over 100 of those, and pretty much any query will only deal
> with a single station at a time. Perhaps if more partitioning would be
> better, portion by both station and channel? The queries that need to
> be fastest will only be looking at a single channel of a single station.
>
> I’ll look into this a bit more, maybe try some experimenting while I
> still have *relatively* little data. My main hesitation here is that
> in the brief look I’ve given partitioning so far, it looks to be a
> royal pain to get set up. Any tips for making that easier?
>
>
If no queries address multiple stations you could do a table per
station. Doesn't smell good but you have a lot of data and well, speed
kills.

I think the date-station-channel could "take over" for the
station-date. Naturally the latter is chosen if you give just the two
fields, but I would be curious to see how well the former performs given
just its first two fields(when station-date doesn't exist).

In response to

Re: Growth planning at 2021-10-04 17:09:54 from Israel Brewster

Responses

Re: Growth planning at 2021-10-04 17:36:32 from Israel Brewster
Re: Growth planning at 2021-10-04 22:04:11 from Israel Brewster

Browse pgsql-general by date

	From	Date	Subject
Next Message	Adrian Klaver	2021-10-04 17:28:34	Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres
Previous Message	Shaozhong SHI	2021-10-04 17:10:29	Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres