From: | Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Cc: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, <tomas(dot)vondra(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com> |
Subject: | Re: On columnar storage (2) |
Date: | 2015-12-30 02:07:19 |
Message-ID: | 56833C57.1090400@BlueTreble.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/28/15 1:15 PM, Alvaro Herrera wrote:
> Currently within the executor
> a tuple is a TupleTableSlot which contains one Datum array, which has
> all the values coming out of the HeapTuple; but for split storage
> tuples, we will need to have a TupleTableSlot that has multiple "Datum
> arrays" (in a way --- because, actually, once we get to vectorise as in
> the preceding paragraph, we no longer have a Datum array, but some more
> complex representation).
>
> I think that trying to make the FDW API address all these concerns,
> while at the same time*also* serving the needs of external data
> sources, insanity will ensue.
Are you familiar with DataFrames in Pandas[1]? They're a collection of
Series[2], which are essentially vectors. (Technically, they're more
complex than that because you can assign arbitrary indexes). So instead
of the normal collection of rows, a DataFrame is a collection of
columns. Series are also sparse (like our tuples), but the sparse value
can be anything, not just NULL (or NaN in panda-speak). There's also
DataFrames in R; not sure how equivalent they are.
I mention this because there's a lot being done with dataframes and they
might be a good basis for a columnstore API, killing 2 birds with one stone.
BTW, the underlying python type for Series is ndarrays[3], which are
specifically designed to interface to things like C arrays. So a column
store could potentially be accessed directly.
Aside from potential API inspiration, it might be useful to prototype a
columnstore using Series (or maybe ndarrays).
[1]
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
[2] http://pandas.pydata.org/pandas-docs/stable/api.html#series
[3] http://docs.scipy.org/doc/numpy-1.10.0/reference/internals.html
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2015-12-30 02:14:36 | Re: Additional role attributes && superuser review |
Previous Message | David Rowley | 2015-12-30 00:39:55 | Re: Combining Aggregates |