Re: Turbo ODBC

From: Wes McKinney <wesmckinn(at)gmail(dot)com>
To: Matthew Rocklin <mrocklin(at)continuum(dot)io>
Cc: Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>, "psycopg(at)postgresql(dot)org" <psycopg(at)postgresql(dot)org>, Uwe Korn <uwelk(at)xhochy(dot)com>
Subject: Re: Turbo ODBC
Date: 2017-01-17 01:20:48
Message-ID: CAJPUwMA2gnH+ebzGpWfKJL1xa5r2VFcdHUibK_aiLLccDQPc6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: psycopg

Adding Uwe Korn -- he's been working on adding C++-level Arrow support
to TurbODBC:

https://github.com/blue-yonder/turbodbc/pull/26

We're using Arrow data structures as an efficient (column-oriented)
intermediary en route to pandas -- you could also implement normal
Python tuple iteration on top of Arrow record batches.

The main Python-compatible Arrow implementation is in C++11 so if you
wanted to use in C you'd need to do a little work to create C
bindings.

On Mon, Jan 16, 2017 at 6:24 PM, Matthew Rocklin <mrocklin(at)continuum(dot)io> wrote:
> CC'ing Wes McKinney in case he's interested in this problem (how do we
> efficiently move data from Postgres into Python).
>
> On Mon, Jan 16, 2017 at 6:14 PM, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com> wrote:
>>
>> On 1/15/17 3:25 PM, Adrian Klaver wrote:
>>>
>>> Have you looked at asyncpg:
>>
>>
>> I'm pulling Matt Rocklin in, who I've been working with on some plpython
>> improvements. Here's his offlist replies to me:
>>
>>> On 1/15/17 5:15 PM, Matthew Rocklin wrote:
>>>>
>>>> You might also contact Wes McKinney and check out
>>>> Arrow: http://pyarrow.readthedocs.io/en/latest/
>>>>
>>>> This project was designed for inter-system data interchange. It would
>>>> help with conversion to Spark and other popular tools as well.
>>>>
>>>> On Sun, Jan 15, 2017 at 6:01 PM, Matthew Rocklin <mrocklin(at)continuum(dot)io
>>>> <mailto:mrocklin(at)continuum(dot)io>> wrote:
>>>>
>>>> I haven't taken a look at asyncpg.
>>>>
>>>> Quick note, you might want to be careful about mixing HDFS and
>>>> HDF5. Postgres's competition for data storage is HDF5 not HDFS
>>>> (Which stands for the Hadoop File System)
>>>>
>>>> I still think that the best thing to do here would be to dump out
>>>> python arrays for each of the columns in a result set. I suspect
>>>> that you could beat /any/ system that doesn't do this handily. This
>>>> would avoid any performance pain of building up Python objects, and
>>>> can continue to use just the standard library.
>>>>
>>>> You can stick to Pure Python and still send arrays encoded with
>>>> bytes. We can convert these to NumPy or Pandas trivially with zero
>>>> overhead.
>>>>
>>>> In [1]: from array import array # this is in the standard library
>>>> In [2]: b = b'\x00' * 16 # 16 zeroes as a bytestring
>>>> In [3]: a = array('L', b) # eight byte unsigned integer
>>>> In [4]: a
>>>> Out[4]: array('L', [0, 0])
>>>>
>>>> We would only be bound by
>>>>
>>>> 1. The cost in postgres to tranpose the data from row-major to
>>>> column major
>>>> 2. The cost to move bytes across a socket
>>>>
>>>> This should run at gigabytes per second. Everything else is likely
>>>> to be competing around to 50-100MB/s range at top speed is my guess.
>>
>>
>> My guess is that the big gains for external communication would come from:
>>
>> 1) reducing latency effects
>> 2) reducing data marshaling
>>
>> 1 could be accomplished in various ways, and some changes to the FE-BE
>> protocol might be helpful. If we can demonstrate a clear win we might be
>> able to get protocol changes. I do suspect this needs to be tune-able
>> though. My 4x improvement to plpy.execute is changing from the equivalent to
>> a large batch to single row operation. Moving data to python over a local
>> filesystem socket would be hurt be a large batch size, while something on a
>> high latency connection would presumably benefit from moderately large
>> batches since that reduces overall TCP overhead.
>>
>> I think 2 would necessitate FE-BE changes, although some of that could be
>> tested without the changes. Right now, everything is going to get marshaled
>> into a simple text format (ie: ints become a string of digits), but for test
>> purposes there's ways you could run that through the binary output functions
>> (in the case of int4, you'd get 4 bytes in network order, which should be
>> faster to handle (and could be passed directly to something like ndarray).
>>
>> The two problems you run into with the array type are handling NULLs and
>> building columns from rows. You could allow the option of specifying that
>> NULLs won't be allowed in specific fields, or you could use something like a
>> Pandas Series that provides other ways of handling the equivalent of NULL.
>>
>> For the row to column conversion, there's simply no way to get around the
>> pattern of appending items one at a time to an array-like container (list,
>> ndarray, Series, etc), and dealing with the memory allocation problem. I've
>> looked at python's list code and it's certainly no worse at that then what
>> Postgres would do. The big reason I would do that work in python though is
>> you then have the ability to use an number of types for that; they just have
>> to support the equivalent of append().
>>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
>> Experts in Analytics, Data Architecture and PostgreSQL
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>> 855-TREBLE2 (855-873-2532)
>
>

In response to

Responses

Browse psycopg by date

  From Date Subject
Next Message Adrian Klaver 2017-01-17 01:32:22 Re: Turbo ODBC
Previous Message Christophe Pettus 2017-01-16 23:29:53 Re: Nested transactions support for code composability