Re: optimizing import of large CSV file into partitioned table?

From: Nagy Zoltan <kirk(at)bteam(dot)hu>
To: Thom Brown <thombrown(at)gmail(dot)com>, postgresql Forums <pgsql-general(at)postgresql(dot)org>
Subject: Re: optimizing import of large CSV file into partitioned table?
Date: 2010-03-29 00:23:15
Message-ID: 4BAFF2F3.8010506@bteam.hu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

hi,

i would recommend to convert the input using some scripts into pg_dump format and use small temporary tables
without indexes to import into - and after all data in the db you could partition it...

you can pre-partition your data using simple grep, this way you can import the data directly into a partitioned scheme

kirk

Thom Brown wrote:
> On 28 March 2010 18:33, Rick Casey <caseyrick(at)gmail(dot)com
> <mailto:caseyrick(at)gmail(dot)com>> wrote:
>
> After careful research, I would to post the following problem I'm
> having with the importing of a large (16Gb) CSV file. Here is brief
> synopsis:
> - this is running on Postgres (PG) version: PostgreSQL 8.3.9 on
> i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu
> 4.3.2-1ubuntu11) 4.3.2
> - it is running on a Ubuntu (small) server instance at Amazon Web
> Services (AWS), with a 320Gb volume mounted for the PG data directory
> - the database was created using the partition example in the
> documentation, with an insert trigger and a function to direct which
> table where records get inserted.
> (see below for code on my table and trigger creation)
>
> After some days of attempting to import the full 16Gb CSV file, I
> decided to split the thing up, using the split utility in Linux.
> This seemed to improve things; once I had split the CSV files into
> about 10Mb size files, I finally got my first successful import of
> about 257,000 recs. However, this is going to be a rather labor
> intensive process to import the full 16Gb file, if I have to
> manually split it up, and import each smaller file separately.
>
> So, I am wondering if there is any to optimize this process? I have
> been using Postgres for several years, but have never had to
> partition or optimize it for files of this size until now.
> Any comments or suggestions would be most welcomed from this
> excellent forum.
>
> (I might add that I spend several weeks prior to this trying to get
> this to work in MySQL, which I finally had to abandon.)
>
> Sincerely,
> Rick
>
> Details of the code follow:
>
> Here is the basic COPY command, which I run as the postgres user, to
> import the CSV files:
> <begin>
> COPY allcalls FROM '/data/allcalls-csv/sub3ab' WITH CSV;
>
> Here is what some sample data look like in the files:
> 3153371867,2008-02-04 16:11:00,1009,1,40
> 2125673062,2008-02-04 16:11:00,1009,1,41
> 5183562377,2008-02-04 16:11:00,1009,1,50
> ...
>
> Here are the basic scripts that created the partition table and
> insert trigger:
> CREATE TABLE allcalls (
> phonenum bigint,
> callstarted timestamp without time zone,
> status int,
> attempts int,
> duration int
> );
> CREATE TABLE allcalls_0 (
> CHECK ( phonenum < 1000000000 )
> ) INHERITS (allcalls);
> ...(repeat this 9 more times, for 10 subpartition tables)
>
> CREATE INDEX allcalls_0_phonenum ON allcalls_0 (phonenum);
> ..(repeat this 9 more times, for indexes on the 10 subpartition tables)
> CREATE OR REPLACE FUNCTION allcalls_insert_trigger()
> RETURNS TRIGGER AS $$
> BEGIN
> IF ( NEW.phonenum < 1000000000 ) THEN
> INSERT INTO allcalls_0 VALUES (NEW.*);
> ELSIF ( NEW.phonenum >= 1000000000 AND NEW.phonenum < 2000000000
> ) THEN
> INSERT INTO allcalls_1 VALUES (NEW.*);
> ...(again, repeat for rest of the parition tables)
>
> CREATE TRIGGER insert_phonenum_trigger
> BEFORE INSERT ON allcalls
> FOR EACH ROW EXECUTE PROCEDURE allcalls_insert_trigger();
>
> <end>
>
>
> The problem here is that you appear to require an index update, trigger
> firing and constraint check for every single row. First thing I'd
> suggest is remove the indexes. Apply that after your import, otherwise
> it'll have to update the index for every single entry. And the trigger
> won't help either. Import into a single table and split it out into
> further tables after if required. And finally the constraint should
> probably be applied after too, so cull any violating rows after importing.
>
> Thom

--
Nagy Zoltan (kirk) <kirk(at)bteam(dot)hu>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andy Colson 2010-03-29 00:43:12 Re: simultaneously reducing both memory usage and runtime for a query
Previous Message Yar Tykhiy 2010-03-29 00:20:06 Re: warm standby possible with 8.1?