Quick Links

Re: COPY FROM performance improvements

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Mark Wong <markw(at)osdl(dot)org>
Cc:	Alon Goldshuv <agoldshuv(at)greenplum(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: COPY FROM performance improvements
Date:	2005-07-19 22:17:52
Message-ID:	42DD7C10.4090008@dunslane.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-patches pgsql-performance

Mark,

You should definitely not be doing this sort of thing, I believe:

CREATE TABLE orders (
o_orderkey INTEGER,
o_custkey INTEGER,
o_orderstatus CHAR(1),
o_totalprice REAL,
o_orderDATE DATE,
o_orderpriority CHAR(15),
o_clerk CHAR(15),
o_shippriority INTEGER,
o_comment VARCHAR(79),
PRIMARY KEY (o_orderkey))

Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want using ALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferring constarint creation.

cheers

andrew

Mark Wong wrote:

>Hi Alon,
>
>Yeah, that helps. I just need to break up my scripts a little to just
>load the data and not build indexes.
>
>Is the following information good enough to give a guess about the data
>I'm loading, if you don't mind? ;) Here's a link to my script to create
>tables:
>http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in
>
>File sizes:
>-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
>-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
>-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
>-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
>-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
>-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
>-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
>-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
>
>Number of rows:
># wc -l *.tbl
> 15000000 customer.tbl
> 600037902 lineitem.tbl
> 25 nation.tbl
> 150000000 orders.tbl
> 20000000 part.tbl
> 80000000 partsupp.tbl
> 5 region.tbl
> 1000000 supplier.tbl
>
>Thanks,
>Mark
>
>On Tue, 19 Jul 2005 14:05:56 -0700
>"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
>
>
>
>>Hi Mark,
>>
>>I improved the data *parsing* capabilities of COPY, and didn't touch the
>>data conversion or data insertion parts of the code. The parsing improvement
>>will vary largely depending on the ratio of parsing -to- converting and
>>inserting.
>>
>>Therefore, the speed increase really depends on the nature of your data:
>>
>>100GB file with
>>long data rows (lots of parsing)
>>Small number of columns (small number of attr conversions per row)
>>less rows (less tuple insertions)
>>
>>Will show the best performance improvements.
>>
>>However, same file size 100GB with
>>Short data rows (minimal parsing)
>>large number of columns (large number of attr conversions per row)
>>AND/OR
>>more rows (more tuple insertions)
>>
>>Will show improvements but not as significant.
>>In general I'll estimate 40%-95% improvement in load speed for the 1st case
>>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>>etc... This is for TEXT format. As for CSV, it may be faster but not as much
>>as I specified here. BINARY will stay the same as before.
>>
>>HTH
>>Alon.
>>
>>
>>
>>
>>
>>
>>On 7/19/05 12:54 PM, "Mark Wong" <markw(at)osdl(dot)org> wrote:
>>
>>
>>
>>>On Thu, 14 Jul 2005 17:22:18 -0700
>>>"Alon Goldshuv" <agoldshuv(at)greenplum(dot)com> wrote:
>>>
>>>
>>>
>>>>I revisited my patch and removed the code duplications that were there, and
>>>>added support for CSV with buffered input, so CSV now runs faster too
>>>>(although it is not as optimized as the TEXT format parsing). So now
>>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
>>>>
>>>>
>>>Hi Alon,
>>>
>>>I'm curious, what kind of system are you testing this on? I'm trying to
>>>load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
>>>interested in the results you would expect.
>>>
>>>Mark
>>>
>>>
>>>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>
>
>

In response to

Re: COPY FROM performance improvements at 2005-07-19 21:37:58 from Mark Wong

Responses

Re: COPY FROM performance improvements at 2005-07-19 22:51:33 from Mark Wong

Browse pgsql-patches by date

	From	Date	Subject
Next Message	Mark Wong	2005-07-19 22:51:33	Re: COPY FROM performance improvements
Previous Message	Alon Goldshuv	2005-07-19 22:06:17	Re: COPY FROM performance improvements

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Mark Wong	2005-07-19 22:51:33	Re: COPY FROM performance improvements
Previous Message	Alon Goldshuv	2005-07-19 22:06:17	Re: COPY FROM performance improvements