Normalizing Unnormalized Input

From: Stephen Froehlich <s(dot)froehlich(at)cablelabs(dot)com>
To: "pgsql-novice(at)postgresql(dot)org" <pgsql-novice(at)postgresql(dot)org>
Subject: Normalizing Unnormalized Input
Date: 2017-06-20 22:50:48
Message-ID: DM5PR06MB28919D0E5AB84F8D18CD207EE5C50@DM5PR06MB2891.namprd06.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

I have been building a PostgreSQL database for about a year and (shame) when I first built it, I built it unnormalized. I am in the process of normalizing it at the moment (thankfully it isn't live for anyone but me, so I can freeze it until I get it all normalized).

The part of the problem that I haven't solved conceptually yet is how to normalize the incoming data. Historically, I have been pulling a csv into R and perform whatever cleaning is necessary and then writing that unnormalized table straight out to PostgreSQL using the cumbersome and slow "caroline::dbWriteTable2" function.

I have seen:

https://stackoverflow.com/questions/23708278/how-to-normalize-data-efficently-while-inserting-into-sql-table-postgres

... but given the batch nature of my process, I was thinking of something a little different than this line-by-line function.

Instead, I was thinking about the following algorithm:

* Have an empty unnormalized table without indexes that is written into in batches by R.
* Have a function triggered by the update of that table that will:
* Rename the table to a temp table name
* Create a fresh blank table
* "UPSERT" INSERT INTO ... ON CONFLICT DO NOTHING each of the foreign key tables to capture any new values
* Then build indexes on each of the text fields of the unnormalized table and then do a INSERT INTO the main normalized table with a SELECT with all of the proper joins ..
* Delete the temp table

Does this sound like a good method? How will it behave if I am doing two simultaneous writes to the unnormalized input table?

One thing I'm not sure of yet is how in a function to retain a variable name of the new temp table name, but I haven't Googled that yet.

I am running PostgreSQL 9.6 and given that I both benefit from parallelization and also don't run anything "live", I'll likely be upgrading to PostgreSQL 10 in August when it is final. I do share the server with a couple of colleagues, but that's all the coordination I need.

Stephen Froehlich
Sr. Strategist, CableLabs(r)

s(dot)froehlich(at)cablelabs(dot)com
Tel: +1 (303) 661-3708

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message David G. Johnston 2017-06-20 23:10:46 Re: Normalizing Unnormalized Input
Previous Message Mick 2017-06-20 19:23:50 Re: Question on installation of postgresql