From: | Stephen Froehlich <s(dot)froehlich(at)cablelabs(dot)com> |
---|---|
To: | "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org> |
Subject: | Normalizing Unnormalized Input |
Date: | 2017-06-20 22:50:48 |
Message-ID: | DM5PR06MB28919D0E5AB84F8D18CD207EE5C50@DM5PR06MB2891.namprd06.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-novice |
I have been building a PostgreSQL database for about a year and (shame) when I first built it, I built it unnormalized. I am in the process of normalizing it at the moment (thankfully it isn't live for anyone but me, so I can freeze it until I get it all normalized).
The part of the problem that I haven't solved conceptually yet is how to normalize the incoming data. Historically, I have been pulling a csv into R and perform whatever cleaning is necessary and then writing that unnormalized table straight out to PostgreSQL using the cumbersome and slow "caroline::dbWriteTable2" function.
I have seen:
... but given the batch nature of my process, I was thinking of something a little different than this line-by-line function.
Instead, I was thinking about the following algorithm:
* Have an empty unnormalized table without indexes that is written into in batches by R.
* Have a function triggered by the update of that table that will:
* Rename the table to a temp table name
* Create a fresh blank table
* "UPSERT" INSERT INTO ... ON CONFLICT DO NOTHING each of the foreign key tables to capture any new values
* Then build indexes on each of the text fields of the unnormalized table and then do a INSERT INTO the main normalized table with a SELECT with all of the proper joins ..
* Delete the temp table
Does this sound like a good method? How will it behave if I am doing two simultaneous writes to the unnormalized input table?
One thing I'm not sure of yet is how in a function to retain a variable name of the new temp table name, but I haven't Googled that yet.
I am running PostgreSQL 9.6 and given that I both benefit from parallelization and also don't run anything "live", I'll likely be upgrading to PostgreSQL 10 in August when it is final. I do share the server with a couple of colleagues, but that's all the coordination I need.
Stephen Froehlich
Sr. Strategist, CableLabs(r)
s(dot)froehlich(at)cablelabs(dot)com
Tel: +1 (303) 661-3708
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2017-06-20 23:10:46 | Re: Normalizing Unnormalized Input |
Previous Message | Mick | 2017-06-20 19:23:50 | Re: Question on installation of postgresql |