From: | Kohei KaiGai <kaigai(at)heterodb(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | [ANN] pg2arrow |
Date: | 2019-01-28 01:43:10 |
Message-ID: | CAOP8fzaG+yy7fo7V7RtFCV0g3MEa2jtEf6ouNoG3_mWw7hupUg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general pgsql-hackers |
Hello,
I made a utility program to dump PostgreSQL database in Apache Arrow format.
Apache Arrow is a kind of data format for columnar-based structured
data; actively
developed by Spark and comprehensive communities.
It is suitable data representation for static and read-only but large
number of rows.
Many of data analytics tools support Apache Arrow as a common data
exchange format.
See, https://arrow.apache.org/
* pg2arrow
https://github.com/heterodb/pg2arrow
usage:
$ ./pg2arrow -h localhost postgres -c 'SELECT * FROM hogehoge LIMIT
10000' -o /tmp/hogehoge.arrow
--> fetch results of the query, then write out "/tmp/hogehoge"
$ ./pg2arrow --dump /tmp/hogehoge
--> shows schema definition of the "/tmp/hogehoge"
$ python
>>> import pyarrow as pa
>>> X = pa.RecordBatchFileReader("/tmp/hogehoge").read_all()
>>> X.schema
id: int32
a: int64
b: double
c: struct<x: int32, y: double, z: decimal(30, 11), memo: string>
child 0, x: int32
child 1, y: double
child 2, z: decimal(30, 11)
child 3, memo: string
d: string
e: double
ymd: date32[day]
--> read the Apache Arrow file using PyArrow, then shows its schema definition.
It is also a groundwork for my current development - arrow_fdw; which
allows to scan
on the configured Apache Arrow file(s) as like regular PostgreSQL table.
I expect integration of the arrow_fdw support with SSD2GPU Direct SQL
of PG-Strom
can pull out maximum capability of the latest hardware (NVME and GPU).
Likely, it is an ideal configuration for log-data processing generated
by many sensors.
Please check it.
Comments, ideas, bug-reports, and other feedbacks are welcome.
As an aside, NVIDIA announced their RAPIDS framework; to exchange data frames
on GPU among multiple ML/Analytics solutions. It also uses Apache
Arrow as a common
format for data exchange, and this is also our groundwork for them.
https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/
Thanks,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai(at)heterodb(dot)com>
From | Date | Subject | |
---|---|---|---|
Next Message | 吉成恒 | 2019-01-28 03:11:42 | type int2vector |
Previous Message | Adrian Klaver | 2019-01-28 01:18:08 | Re: Error message restarting a database |
From | Date | Subject | |
---|---|---|---|
Next Message | Imai, Yoshikazu | 2019-01-28 01:44:28 | RE: speeding up planning with partitions |
Previous Message | David Rowley | 2019-01-28 01:26:10 | Re: Delay locking partitions during query execution |