Re: [SPAM] COPY command & binary format

From: Nicolas Paris <niparisco(at)gmail(dot)com>
To: Cat <cat(at)zip(dot)com(dot)au>
Cc: Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>, Forums postgresql <pgsql-general(at)postgresql(dot)org>
Subject: Re: [SPAM] COPY command & binary format
Date: 2016-05-14 13:47:07
Message-ID: CA+ssMOS8uiXyS2UkDB9P3hfOv+_pTt5=yLg0xZskO=-8nhF9Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Well the job is done. The talend component is working (
https://github.com/parisni/talend/tree/master/tPostgresqlOutputBulkAPHP).
It allows creating a file (binary or csv) locally, and then use the COPY
function with "FROM STDIN" that does not need to push the file on a remote
database server.

I have made a little comparison test:

column1: character varying
column2: integer
column3: boolean
10 000 000 tuples

Type | Create file time | Bulk load time | Total
Time | File size
Binary | 11137 milliseconds | 21661 milliseconds | 32798 milliseconds |
250 MO
CSV | 23226 milliseconds | 22192 milliseconds | 45418 milliseconds |
179 MO

Binary format is definitely faster and safer
- faster because writing binary is faster than text file. I guess the bulk
load time bottleneck is the network, then this is equivalent for both
format. It is two time faster to load a binary when the file is on the
database server.
- safer thanks to the format (each value is preceded by its lenght) more
robust thant CSV and separators (that can be present in the text).

Code has been based on :

-
https://github.com/uwescience/myria/blob/master/src/edu/washington/escience/myria/PostgresBinaryTupleWriter.java
-
https://github.com/bytefish/PgBulkInsert/tree/master/PgBulkInsert/src/main/de/bytefish/pgbulkinsert/pgsql/handlers

Thanks,

2016-05-10 15:08 GMT+02:00 Cat <cat(at)zip(dot)com(dot)au>:

> On Tue, May 10, 2016 at 03:00:55PM +0200, Nicolas Paris wrote:
> > > The way I want is :
> > > csv -> binary -> postgresql
> > >
> > > Is this just to be quicker or are you going to add some business logic
> > > while converting CSV data?
> > > As you mentioned ETL, I assume the second, as I don't think that
> > > converting CSV to binary and then loading it to PostgreSQL will be more
> > > convenient than loading directly from CSV... as quicker as it can be,
> you
> > > have anyway to load data from CSV.
> > >
> > ​
> > Right, ETL process means huge business logic.
> > get the data (csv or other) -> transform it -> produce a binary -> copy
> > from binary from stdin ​
> >
> > Producing 100GO CSVs, is a waste of time.
>
> Ah. You need to fiddle with the data. Then you need to weigh the pros of
> something agnostic to Postgres's internals to something that needs to be
> aware of them.
>
> You will need to delve into the source code for data types more complex
> than INTEGER, TEXT and BYTEA (which was the majority of my data when I
> was just looking into it).
>
> --
> "A search of his car uncovered pornography, a homemade sex aid, women's
> stockings and a Jack Russell terrier."
> -
> http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-1111118083480
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jeff Janes 2016-05-14 15:59:21 Re: Update or Delete causes canceling of long running slave queries
Previous Message Andreas Kretschmer 2016-05-14 12:34:34 Re: Keeping top N records of a group