From: | Richard Huxton <dev(at)archonet(dot)com> |
---|---|
To: | Dane Springmeyer <blake(at)hailmail(dot)net>, PG-General Mailing List <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Normalization tools for postgres? |
Date: | 2007-11-21 20:15:13 |
Message-ID: | 474491D1.9090403@archonet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-php |
Dane Springmeyer wrote:
> On Nov 21, 2007, at 1:37 AM, Richard Huxton wrote:
>> Dane Springmeyer wrote:
>>> These mostly reside in MS access in flat tables and and I am
>>> importing them into postgres.
>>
>> If you are familiar/comfortable with Access and VB, I'd probably do
>> the work there.
>
> Unfortunately I am not. I am only familiar with very basic SQL in Access
> and not with VB. And I am not interested in investing any time learning
> microsoft products. I'd rather put time in learning postgres and php or
> other languages to manipulate data in postgres.
Fair enough. It's easy enough to get started with PHP. Oh, you don't
need to run it in a webserver, you can do so from the command-line too.
Any of the scripting languages will do nicely for this sort of thing -
Perl, Python, Ruby etc.
>> 1. Import as-is into a table called e.g. raw_data
>>
>> 2. Identify/add the primary-key (presumably ID in this case) in raw_data
>> If no ID, you can add a column of type SERIAL to raw_data and let it
>> be populated automatically.
>>
>> 3. CREATE TABLE lookups.region (id SERIAL, description text NOT NULL
>> UNIQUE, PRIMARY KEY id);
>> INSERT INTO lookups.region (description) SELECT DISTINCT region from
>> raw_data;
>>
>> 4. Repeat #3 for other lookups
>>
>> 5. CREATE TABLE processed_data (...);
>> INSERT INTO processed_data (id, region_id, ...)
>> SELECT raw.id, lkp_reg.id, ...
>> FROM raw_data raw
>> JOIN lookups.region lkp_reg ON raw.region = lkp_reg.description
>> JOIN lookups.whatever...
> Wow. That was EXTREMELY helpful. With those concepts I've not been able
> to do EXACTLY what I was shooting for AND now understand the SQL well
> enough to start thinking of better ways to do it as well. Thank you.
Good.
> Here is the SQL which inserts the sample data and processes it into 4
> different tables. Perhaps I after you take a look I could post this back
> to the group?
I've cc:ed the list on this one for you - plenty of smart people on that
list.
> CREATE TABLE raw_data (
Later on, you might want to consider CREATE TEMPORARY TABLE... but don't
worry for the moment.
> wid int4,
> region character(35),
> drain character(65),
> eco character(29)
These should almost certainly be "varchar" (or "character varying") -
unless you actually want each field padded with spaces. In fact, since
this is just a temporary thing I'd make them all type "text".
> );
>
> INSERT INTO raw_data VALUES (11210, 'SW Oregon / N Cali Coast', 'Rogue
> River', 'California Coast');
> INSERT INTO raw_data VALUES (11100, 'SW Oregon / N Cali Coast', 'Coastal
> grouping of Chetco River and Pistol River', 'California Coast');
You might want to read the manuals regarding the COPY <table> FROM STDIN
command. Good for bulk-loading data.
Also, you could download the pg-odbc driver and link to PG from Access
to copy the data over.
>
> drop table region;
There's an "IF EXISTS" clause you can add to DROP TABLE - prevents errors.
> CREATE TABLE region (id SERIAL, name text NOT NULL UNIQUE, PRIMARY KEY
> (id));
> INSERT INTO region (name) SELECT DISTINCT region from raw_data;
> Select * from region;
[snip repeats for other tables]
> drop table processed_data;
> CREATE TABLE processed_data (wid int4 NOT NULL UNIQUE, region_id int4
> NOT NULL, eco_id int4 NOT NULL, drain_id int4 NOT NULL, PRIMARY KEY (wid));
> INSERT INTO processed_data (wid, region_id, eco_id, drain_id)
> SELECT w.wid, r.id, e.id, d.id
> FROM raw_data as w, region as r, eco as e, drain as d
> WHERE w.region = r.name AND w.drain = d.name AND w.eco = e.name;
Good. You'll want to read up on foreign-keys too. You can define them
when you build the table, or add them after. Something like (not checked):
ALTER TABLE processed_data ADD CONSTRAINT valid_region
FOREIGN KEY (region_id) REFERENCES region (id);
>> 6. Wrap the entire lot above in BEGIN...END so it either all works or
>> it doesn't then VACUUM FULL the database at key points.
>
> I'm not familiar with these ideas, but I'll look into them in the docs...
Read up on VACUUM, ANALYSE and the autovacuum daemon (agent).
>> If you're comfortable with a bit of VB/Perl/Python/PHP/plpgsql then
>> you can automate that fairly simply. If not, a bit of cut+paste will
>> see you there.
>
> Yes, I think I'll experiment with trying to produce this SQL text with
> php. Seems like in a very short time I could have a custom script to
> parse any table given column names and types which I'd like to break
> out. Very nice.
The fun comes when you have to clean up the data - correct spelling
mistakes, remove duplicates etc.
--
Richard Huxton
Archonet Ltd
From | Date | Subject | |
---|---|---|---|
Next Message | Garber, Mikhail | 2007-11-21 20:20:51 | Read-only availability of a standby server? |
Previous Message | Marc Munro | 2007-11-21 19:44:50 | Coordinating database user accounts with active directory |
From | Date | Subject | |
---|---|---|---|
Next Message | Emmanuel Nnko | 2007-12-28 15:31:45 | Installation Error for installation of Apache-1.3.39+php-5.2.5+OpenSSL-0.9.8g+PostgreSQL-8.1.10 |
Previous Message | gunartha | 2007-11-21 19:30:56 | string function |