From: | Nicolas Paris <nicolas(dot)paris(at)riseup(dot)net> |
---|---|
To: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | announce: spark-postgres 3 released |
Date: | 2019-11-11 00:05:36 |
Message-ID: | 20191111000536.4vuo3wlmqkv3wojd@riseup.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello postgres users,
Spark-postgres is designed for reliable and performant ETL in big-data
workload and offers read/write/scd capability to better bridge spark and
postgres. The version 3 introduces a datasource API. It outperforms
sqoop by factor 8 and the apache spark core jdbc by infinity.
Features:
- use of pg COPY statements
- parallel reads/writes
- use of hdfs to store intermediary csv
- reindex after bulk-loading
- SCD1 computations done on the spark side
- use unlogged tables when needed
- handle arrays and multiline string columns
- useful jdbc functions (ddl, updates...)
The official repository:
https://framagit.org/parisni/spark-etl/tree/master/spark-postgres
And its mirror on microsoft github:
https://github.com/EDS-APHP/spark-etl/tree/master/spark-postgres
--
nicolas
From | Date | Subject | |
---|---|---|---|
Next Message | Nicolas Paris | 2019-11-11 00:16:49 | Re: How to import Apache parquet files? |
Previous Message | Matthias Apitz | 2019-11-09 18:45:31 | Re: type SERIAL in C host-struct |