Quick Links

Re: New Copy Formats - avro/orc/parquet

From:	Nicolas Paris <niparisco(at)gmail(dot)com>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: New Copy Formats - avro/orc/parquet
Date:	2018-02-11 22:02:36
Message-ID:	20180211220236.nwskn6nnrpe3zvyf@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Le 11 févr. 2018 à 22:19, Adrian Klaver écrivait :
> On 02/11/2018 12:57 PM, Nicolas Paris wrote:
> > Le 11 févr. 2018 à 21:53, Andres Freund écrivait :
> > > On 2018-02-11 21:41:26 +0100, Nicolas Paris wrote:
> > > > I have also the storage and network transfers overhead in mind:
> > > > All those new formats are compressed; this is not true for current
> > > > postgres BINARY format and obviously text based format. By experience,
> > > > the binary format is 10 to 30% larger than the text one. On the
> > > > contrary, an ORC file can be up to 10 times smaller than a text base
> > > > format.
> > >
> > > That seems largely irrelevant when arguing about using PROGRAM though,
> > > right?
> > >
> >
> > Indeed those storage and network transfers are only considered versus
> > CSV/BINARY format. No link with PROGRAM aspect.
> >
>
> Just wondering what your time frame is on this? Asking because this would be
> considered a new feature and so would need to be added to a major release of
> Postgres. Currently work is going on for Postgres version 11 to be
> released(just a guess) late Fall 2018/early Winter 2019. The
> CommitFest(https://commitfest.postgresql.org/) for this release is currently
> approximately 3/4 of the way through. Not sure that new code could make it
> in at this point. This means it would be bumped to version 12 for 2019/2020.
>

Right now, exporting (bilions rows * hundred columns) from postgres to
distributed tools such spark is feasible while beeing based on parsing,
transfers, tooling and workaround overhead.

Waiting until 2020 to get the oportunity to write COPY extensions would
mean using this feature around 2022. I mean, writing the ORC COPY
extension, extending the postgres JDBC driver, extending the spark jdbc
connector, all from different communities: this will be a long process.

But again, posgres would be the most advanced RDBMS because AFAIK not
any DB deal with those distributed format for the moment. Having in mind
that such feature will be released one day, make think the place of
postgres in a datawarehouse architecture accordingly.

In response to

Re: New Copy Formats - avro/orc/parquet at 2018-02-11 21:19:31 from Adrian Klaver

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2018-02-11 22:48:13	Re: New Copy Formats - avro/orc/parquet
Previous Message	Adrian Klaver	2018-02-11 21:19:31	Re: New Copy Formats - avro/orc/parquet