Quick Links

Re: [SPAM] COPY command & binary format

From:	Nicolas Paris <niparisco(at)gmail(dot)com>
To:	Cat <cat(at)zip(dot)com(dot)au>
Cc:	Moreno Andreo <moreno(dot)andreo(at)evolu-s(dot)it>, Forums postgresql <pgsql-general(at)postgresql(dot)org>
Subject:	Re: [SPAM] COPY command & binary format
Date:	2016-05-14 13:47:07
Message-ID:	CA+ssMOS8uiXyS2UkDB9P3hfOv+_pTt5=yLg0xZskO=-8nhF9Yw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Well the job is done. The talend component is working (
https://github.com/parisni/talend/tree/master/tPostgresqlOutputBulkAPHP)
It allows creating a file (binary or csv) locally, and then use the COPY
function with "FROM STDIN" that does not need to push the file on a remote
database server.

I have made a little comparison test:

column1: character varying
column2: integer
column3: boolean
10 000 000 tuples

Binary format is definitely faster and safer
- faster because writing binary is faster than text file. I guess the bulk
load time bottleneck is the network, then this is equivalent for both
format. It is two time faster to load a binary when the file is on the
database server.
- safer thanks to the format (each value is preceded by its lenght) more
robust thant CSV and separators (that can be present in the text).

Code has been based on :

-
https://github.com/uwescience/myria/blob/master/src/edu/washington/escience/myria/PostgresBinaryTupleWriter.java
-
https://github.com/bytefish/PgBulkInsert/tree/master/PgBulkInsert/src/main/de/bytefish/pgbulkinsert/pgsql/handlers

Thanks,

2016-05-10 15:08 GMT+02:00 Cat <cat(at)zip(dot)com(dot)au>:

> On Tue, May 10, 2016 at 03:00:55PM +0200, Nicolas Paris wrote:
> > > The way I want is :
> > > csv -> binary -> postgresql
> > >
> > > Is this just to be quicker or are you going to add some business logic
> > > while converting CSV data?
> > > As you mentioned ETL, I assume the second, as I don't think that
> > > converting CSV to binary and then loading it to PostgreSQL will be more
> > > convenient than loading directly from CSV... as quicker as it can be,
> you
> > > have anyway to load data from CSV.
> > >
> >
> > Right, ETL process means huge business logic.
> > get the data (csv or other) -> transform it -> produce a binary -> copy
> > from binary from stdin
> >
> > Producing 100GO CSVs, is a waste of time.
>
> Ah. You need to fiddle with the data. Then you need to weigh the pros of
> something agnostic to Postgres's internals to something that needs to be
> aware of them.
>
> You will need to delve into the source code for data types more complex
> than INTEGER, TEXT and BYTEA (which was the majority of my data when I
> was just looking into it).
>
> --
> "A search of his car uncovered pornography, a homemade sex aid, women's
> stockings and a Jack Russell terrier."
> -
> http://www.dailytelegraph.com.au/news/wacky/indeed/story-e6frev20-1111118083480
>

In response to

Re: [SPAM] COPY command & binary format at 2016-05-10 13:08:47 from Cat

Browse pgsql-general by date

	From	Date	Subject
Next Message	Jeff Janes	2016-05-14 15:59:21	Re: Update or Delete causes canceling of long running slave queries
Previous Message	Andreas Kretschmer	2016-05-14 12:34:34	Re: Keeping top N records of a group