Quick Links

Re: COPY v. java performance comparison

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Rob Sargent <robjsargent(at)gmail(dot)com>
Cc:	PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: COPY v. java performance comparison
Date:	2014-04-02 22:36:46
Message-ID:	CAMkU=1xRuwr92amUVjFZA02M_iegY9PZApMqTiu+Qi3tNfBTmA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsargent(at)gmail(dot)com> wrote:

> I'm playing with various data models to compare performance and
> practicalities and not sure if I should be surprised by the numbers I'm
> getting. I hope this report isn't too wishy-washy for reasoned comment.
>
> One model says a genotype is defined as follows:
>
> Table "public.oldstyle"
> +-------------+--------------+-----------+
> | Column | Type | Modifiers |
> +-------------+--------------+-----------+
> | id | uuid | not null |
> | sample_name | text | not null |
> | marker_name | text | not null |
> | allele1 | character(1) | |
> | allele2 | character(1) | |
> +-------------+--------------+-----------+
> (0. id is a Primary Key)
> (1. Take what you will from the table name.)
> (2. I hadn't thought of "char" type at this point)
> (3. Ultimately the names would become ids, RI included)
> (4. We're loading 39 samples and ~950K markers)
>
> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours
> (800+ records/sec). Then I tried COPY and killed that after 11.25 hours
> when I realised that I had added on non-unque index on the name fields
> after the first load. By that point is was on line 28301887, so ~0.75 done
> which implies it would have take ~15hours to complete.
>
> Would the overhead of the index likely explain this decrease in throughput?
>

Absolutely.

>
> Impatience got the better of me and I killed the second COPY. This time
> it had done 54% of the file in 6.75 hours, extrapolating to roughly 12
> hours to do the whole thing.
>

Are you sure you actually dropped the indices? (And the primary key?)

I get about 375,000 lines per second with no indexes, triggers, constraints.

perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t",
"a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate
oldstyle; copy oldstyle from stdin;'

(More if I truncate it in the same transaction as the copy)

If you can't drop the pk constraint, can you at least generate the values
in sort-order?

Cheers,

Jeff

In response to

COPY v. java performance comparison at 2014-04-02 19:37:24 from Rob Sargent

Responses

Re: COPY v. java performance comparison at 2014-04-02 22:46:06 from Rob Sargent
Re: COPY v. java performance comparison at 2014-04-03 00:11:19 from Rob Sargent

Browse pgsql-general by date

	From	Date	Subject
Next Message	Rob Sargent	2014-04-02 22:46:06	Re: COPY v. java performance comparison
Previous Message	Shaun Thomas	2014-04-02 22:09:41	Re: SSD Drives