Re: Inserting 26 million rows takes 8 hours, how to

From: "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To: Jose Vicente Nunez Z <josevnz(at)newbreak(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Inserting 26 million rows takes 8 hours, how to
Date: 2003-09-16 16:04:27
Message-ID: Pine.LNX.4.33.0309160952250.4245-100000@css120.ihs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

On 16 Sep 2003, Jose Vicente Nunez Z wrote:

> Tom,
>
>
> On Tue, 2003-09-16 at 10:44, Tom Lane wrote:
> > Jose Vicente Nunez Z <josevnz(at)newbreak(dot)com> writes:
> > > For that i'm using the PostgreSQL copy tool, but the problem is that is
> > > taking almost 9 hours to load the data:
> >
> > Hmm, 889 rows/second doesn't sound that bad, considering that you've
> > given us no data about the hardware you're running on. In particular,
> > what's your disk setup?
>
> My fault, i should provide more information about my setup.
>
> I got two ATA disks (one for PostgreSQL, the other for the data files).
> The box is a Pentium III 1.3Ghz, dual CPU with 2 GB of RAM. One disk is
> 30 GB, the other is 40GB.
>
> As you can see is not state of the art harware, but for development
> purposes i should get a decent performance (only one or two applications
> will be accessing the server to do some data mining).
>
> >
> > But at any rate, the only bottlenecks I could think of for a COPY
> > command are:
> >
> > 1. Data conversion. This would depend on the column datatypes of the
> > table, which you didn't tell us.
> >
> create table mytable
> (
> a varchar(20) not null,
> b varchar(20) not null,
> c varchar(20) not null,
> d char(6),
> f int null,
> g float not null,
> h float not null
> )
>
> Data conversions shouldn't be that complicated, isn't it?
>
> > 2. Index updating. This would depend on the indexes on the table,
> > which you didn't tell us.
>
> I had one index and dropped it for the load. Also i turn of the 'fsync'
> flag (fsync=false) and now the load time is only 15 minutes!. But then
> recreating the original index takes forever:
>
> create index myindex on mytable ( a, b, c );

Are you sure this is the type of index you want? Keep in mind postgresql
won't use this index for queries like:

select * from mytable where b='abc';

or

select * from mytable where a='%abc%';

but only for queries that invoke a, then b, then c, and only with left
anchored text.

select * from mytable where a='abc' and b='def';

will work. you might want to look at using the tsearch engine just
updated for 7.4.x for this kind of thing, or just running seq scans with a
whole lot of shared_buffers to hold your data.

Cranking shared_buffers up to 1000 to 20000 or so should help. Note that
there's a point of diminishing returns for certain queries with large
shared_buffers where buffer management costs more than having the data in
the buffer. For what you're doing, however, larger may well be better.

Oh, and don't forget to vacuum and analyze...

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Mendola Gaetano 2003-09-16 16:49:12 Re: Report Generator Proposal
Previous Message Tom Lane 2003-09-16 15:40:51 Re: Inserting 26 million rows takes 8 hours, how to improve those times?