Re: COPY v. java performance comparison

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: COPY v. java performance comparison
Date: 2014-04-03 16:04:13
Message-ID: 533D867D.6020603@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 04/03/2014 09:01 AM, Thomas Kellerer wrote:
> Rob Sargent, 02.04.2014 21:37:
>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+
>> hours (800+ records/sec). Then I tried COPY and killed that after
>> 11.25 hours when I realised that I had added on non-unque index on
>> the name fields after the first load. By that point is was on line
>> 28301887, so ~0.75 done which implies it would have take ~15hours to
>> complete.
>>
>> Would the overhead of the index likely explain this decrease in
>> throughput?
>>
>> Impatience got the better of me and I killed the second COPY. This
>> time it had done 54% of the file in 6.75 hours, extrapolating to
>> roughly 12 hours to do the whole thing.
>>
>> That matches up with the java speed. Not sure if I should be elated
>> with jOOQ or disappointed with COPY.
>>
> This is not what I see with COPY FROM STDIN
>
> When I load 2million rows using a batch size of 1000 with plain JDBC that takes about 4 minutes
>
> Loading the same file through Java and COPY FROM STDIN takes about 4 seconds
>
> The table looks like this:
>
> Table "public.products"
> Column | Type | Modifiers
> -------------------+------------------------+-----------
> product_id | integer | not null
> ean_code | bigint | not null
> product_name | character varying(100) | not null
> manufacturer_name | character varying | not null
> price | numeric(10,2) | not null
> publish_date | date | not null
> Indexes:
> "products_pkey" PRIMARY KEY, btree (product_id)
> "idx_publish_date" btree (publish_date, product_id)
>
>
> During the load both indexes are present.
>
> Regards
> Thomas
>
>
Thomas thanks for these numbers.

I have to straighten out my environment, which I admit I was hoping to
avoid. I reset checkpoint_segments to 12 and restarted my server.
I kicked of the COPY at 19:00. That generated a couple of the "too
frequent" statements but 52 "WARNING: pgstat wait timeout" lines during
the next 8 hours starting at 00:37 (5 hours in) 'til finally keeling
over at 03:04 on line 37363768. That's the last line of the input so
obviously I didn't flush my last println properly. I'm beyond getting
embarrassed at this point.

Is turning auto-vacuum off a reasonable way through this?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrew Sullivan 2014-04-03 16:07:58 Re: Pagination count strategies
Previous Message Andy Colson 2014-04-03 15:58:42 Re: window function help