From: | "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> |
---|---|
To: | "Rob W *EXTERN*" <digital_illuminati(at)yahoo(dot)com>, <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Efficiently move data from one table to another, with FK constraints? |
Date: | 2009-07-07 09:12:36 |
Message-ID: | D960CB61B694CF459DCFB4B0128514C202FF66AC@exadv11.host.magwien.gv.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Rob W wrote:
> I am using COPY to bulk load large volumes (i.e. multi GB
> range) of data to a staging table in a PostgreSQL 8.3. For
> performance, the staging table has no constraints, no primary
> key, etc. I want to move that data into the "real" tables,
> but need some advice on how to do that efficiently.
>
> Here's a simple, abbreviated example of tables and relations
> I'm working with (in reality there are a lot more columns and
> foreign keys).
>
> /* The raw bulk-loaded data. No indexes or constraints. */
> CREATE TABLE log_entry (
> req_time TIMESTAMP NOT NULL,
> url TEXT NOT NULL,
> bytes INTEGER NOT NULL
> );
>
> /* Where the data will be moved to. Will have indexes, etc */
> CREATE TABLE request (
> id BIGSERIAL PRIMARY KEY,
> req_time TIMESTAMP WITH TIME ZONE NOT NULL,
> bytes INTEGER NOT NULL,
> fk_url INTEGER REFERENCES url NOT NULL,
> );
>
> CREATE TABLE url (
> id SERIAL PRIMARY KEY,
> path TEXT UNIQUE NOT NULL,
> );
>
> Is there a way to move this data in bulk efficiently?
> Specifically I'm wondering how to handle the foreign keys?
> The naive approach is:
>
> 1) For each column that is a foreign key in the target table,
> do INSERT ... SELECT DISTINCT ... to copy all the values
> into the appropriate child tables.
> 2) For each row in log_entry, do a similar insert to insert
> the data with the appropriate foreign keys.
> 3) delete the contents of table log_entry using TRUNCATE
>
> Obviously, this would be very slow when handling tens of
> millions of records. Are there faster approaches to solving
> this problem?
How about something like that:
INSERT INTO url (path) (SELET DISTINCT url FROM log_entry);
Then
INSERT INTO request (req_time, bytes, fk_url)
(SELECT l.req_time, l.bytes, u.id FROM log_entry AS l JOIN url AS u ON (l.url = u.path));
I didn't test it, so there may be syntax errors and stuff.
But I doubt that it can be done much more efficiently.
Creating an index on log_entry(url) *might* improve performance. Check with EXPLAIN.
The TRUNCATE should not be a very expensive operation.
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Albe Laurenz | 2009-07-07 09:36:03 | Re: An example needed for Serializable conflict... |
Previous Message | Daniel Verite | 2009-07-07 09:05:29 | Re: Feistel cipher, shorter string and hex to int |