Re: Use Postgres as a column store by creating one table per column

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: George Neuner <gneuner2(at)comcast(dot)net>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Use Postgres as a column store by creating one table per column
Date: 2019-05-24 18:06:29
Message-ID: 20190524180629.42d3u4m7a3jgh5hq@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Thu, May 23, 2019 at 01:08:42AM -0400, George Neuner wrote:
>On Tue, 21 May 2019 21:28:07 -0700, Lev Kokotov
><lev(dot)kokotov(at)gmail(dot)com> wrote:
>
>>Is it efficient to use Postgres as a column store by creating one
>table per
>>column?
>>
>>I would query it with something like `[...] UNION SELECT value AS <table>
>>FROM <table> WHERE value = <value> UNION [...]` to build a row.
>
>I think you mean JOIN.
>
>You'd need more than that: Postgresql uses MVCC for concurrency, so
>whenever you update any row in a table, the ordering of the rows
>within the table changes.  And the JOIN operation inherently is
>unordered - you need to sort the result deliberately to control
>ordering.
>
>To emulate a column-store, at the very least you need a way to
>associate values from different "columns" that belong to the same
>"row" of the virtual table.  IOW, every value in every "column" needs
>an explicit "row" identifier.  E.g.,
>
>   col1 = { rowid, value1 }, col2 = { rowid, value2 }, ...
>
>For performance you would need to have indexes on at least the rowid
>in each of the "column" tables.
>
>This is a bare minimum and can only work if the columns of your
>virtual table and the queries against it are application controlled or
>statically known.  If you want to do something more flexible that will
>support ad hoc table modifications, elastically sized values (strings,
>bytes, arrays, JSON, XML), etc. this example is not suffice and the
>implementation can get very complicated very quickly
>
>
>Justin Pryzby was not joking when he said the performance could be
>awful ... at least as compared to a more normal row-oriented
>structure.  Performance of a query that involves more than a handful
>of "columns", in general, will be horrible.  It is up to you to decide
>whether some (maybe little) increase in performance in processing
>*single* columns will offset likely MASSIVE loss of performance in
>processing multiple columns.
>

Maybe take a look at this paper:

http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf

which essentially compares this approach to a "real" column store.

It certainly won't give you performance comparable to column store, it
adds quite a bit of overhead (disk space because of row headers, CPU
because of extra joins, etc.).

And it can't give you the column-store benefits - compression and/or
more efficient execution.

>
>>I'm thinking since Postgres stores tables in continuous blocks of
>16MB each
>>(I think that's the default page size?)
>
>Default page size is 8 KB.  You'd have to recompile to change that,
>and it might break something - a whole lot of code depends on the
>knowing the size of storage pages.
>
>

Right. And the largest page size is 64kB. But 8kB is a pretty good
trade-off, in most cases.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Stephen Frost 2019-05-24 19:38:15 Re: Use Postgres as a column store by creating one table per column
Previous Message Pavel Stehule 2019-05-24 17:12:13 Re: [HACKERS] proposal: schema variables