Importing Large Amounts of Data

From: Curt Sampson <cjs(at)cynic(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Importing Large Amounts of Data
Date: 2002-04-11 08:28:13
Message-ID: Pine.NEB.4.43.0204111721120.11530-100000@angelic.cynic.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I've been asked by a client to do some testing of Postgres for what
appears to be OLAP on a fairly large data set (about half a billion
tuples). I'm probably going to want to try partitioning this in various
ways, but the application, not Postgres, will deal with that.

I'm using PostgreSQL 7.2.1, and the schema I'm testing with is as follows:

CREATE TABLE bigone (
rec_no int PRIMARY KEY,
day date,
user_id char(5),
value varchar(20)
) WITHOUT OIDS;
CREATE INDEX bigone_day ON bigone (day);
CREATE INDEX bigone_user_id ON bigone (user_id);

Unfortunately, the first problem I've run into is that importing is
rather slow. With all indexes (including the bigone_pkey) dropped,
importing five million tuples into the above table, starting from empty,
takes about 921 seconds. The second 5M tuples takes about 1009 seconds.
If I switch to using the -F option, the first 5M takes 714 seconds and the
second 5M takes 742 seconds. At the end, I have about 742 MB of data under
the data/base directory. (This is using a fresh database cluster.)

For comparison, the MySQL does each import in about 221 and 304 seconds,
and the data in the end take up about 427 MB.

Part of the problem here may be that Postgres appears to be logging the
COPY operation; I get from 27-33 "recycled transaction log file" messages
for every 5M tuple COPY that I do. If there were a way to do an unlogged
copy, that might save close to half a gig of writes to the disk.

The other part of the problem may just be the size of the data;
why does Postgres take up 75% more space (about 78 bytes per tuple,
vs. 45 bytes per tuple) for this table?

As well, index builds seem to take about 20% longer (using -F), and they
seem to be about 10% larger as well.

Does anybody have any suggestions as to how I can improve performance
here, and reduce disk space requirements? If not, I'll probably have
to suggest to the client that he move to MySQL for this particular
application, unless he needs any of the features that Postgres provides
and MySQL doesn't.

--
Curt Sampson <cjs(at)cynic(dot)net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

Browse pgsql-general by date

  From Date Subject
Next Message Philip Reimer 2002-04-11 09:28:28 DataBlades
Previous Message Pierre-Andre Michel 2002-04-11 08:17:09 Why does postgres not take into account my index on a bigint column ?