Re: Growth planning

From: Alban Hertroys <haramrae(at)gmail(dot)com>
To: Israel Brewster <ijbrewster(at)alaska(dot)edu>
Cc: PostgreSQL Mailing Lists <pgsql-general(at)postgresql(dot)org>
Subject: Re: Growth planning
Date: 2021-10-04 20:29:05
Message-ID: 5F0C1A8D-F199-497A-B7FA-8143E48BB020@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


> On 4 Oct 2021, at 18:22, Israel Brewster <ijbrewster(at)alaska(dot)edu> wrote:

(…)

> the script owner is taking about wanting to process and pull in “all the historical data we have access to”, which would go back several years, not to mention the probable desire to keep things running into the foreseeable future.

(…)

> - The largest SELECT workflow currently is a script that pulls all available data for ONE channel of each station (currently, I suspect that will change to all channels in the near future), and runs some post-processing machine learning algorithms on it. This script (written in R, if that makes a difference) currently takes around half an hour to run, and is run once every four hours. I would estimate about 50% of the run time is data retrieval and the rest doing its own thing. I am only responsible for integrating this script with the database, what it does with the data (and therefore how long that takes, as well as what data is needed), is up to my colleague. I have this script running on the same machine as the DB to minimize data transfer times.

I suspect that a large portion of time is spent on downloading this data to the R script, would it help to rewrite it in PL/R and do (part of) the ML calculations at the DB side?

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ron 2021-10-04 20:46:12 Re: Growth planning
Previous Message Shaozhong SHI 2021-10-04 20:13:25 Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres