Quick Links

Re: Zedstore - compressed in-core columnar storage

From:	Ajin Cherian <itsajin(at)gmail(dot)com>
To:	Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Zedstore - compressed in-core columnar storage
Date:	2019-05-24 02:30:19
Message-ID:	CAFPTHDa93qjCWMqJ6-pJj1RSU5uUg9EKFim9OX1nSmMp7e08aw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Ashwin,

- how to pass the "column projection list" to table AM? (as stated in
initial email, currently we have modified table am API to pass the
projection to AM)

We were working on a similar columnar storage using pluggable APIs; one
idea that we thought of was to modify the scan slot based on the targetlist
to have only the relevant columns in the scan descriptor. This way the
table AMs are passed a slot with only relevant columns in the descriptor.
Today we do something similar to the result slot using
ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
a table am handler API to modify the scan tuple slot based on the
targetlist, a probable name for the new table am handler would be:
exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).

So this way the scan am handlers like getnextslot is passed a slot only
having the relevant columns in the scan descriptor. One issue though is
that the beginscan is not passed the slot, so if some memory allocation
needs to be done based on the column list, it can't be done in beginscan.
Let me know what you think.

regards,
Ajin Cherian
Fujitsu Australia

On Thu, May 23, 2019 at 3:56 PM Ashwin Agrawal <aagrawal(at)pivotal(dot)io> wrote:

>
> We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to
> share the recent additions and modifications. Attaching a patch
> with the latest code. Link to github branch [1] to follow
> along. The approach we have been leaning towards is to build required
> functionality, get passing the test and then continue to iterate to
> optimize the same. It's still work-in-progress.
>
> Sharing the details now, as have reached our next milestone for
> Zedstore. All table AM API's are implemented for Zedstore (except
> compute_xid_horizon_for_tuples, seems need test for it first).
>
> Current State:
>
> - A new type of item added to Zedstore "Array item", to boost
> compression and performance. Based on Konstantin's performance
> experiments [2] and inputs from Tomas Vodra [3], this is
> added. Array item holds multiple datums, with consecutive TIDs and
> the same visibility information. An array item saves space compared
> to multiple single items, by leaving out repetitive UNDO and TID
> fields. An array item cannot mix NULLs and non-NULLs. So, those
> experiments should result in improved performance now. Inserting
> data via COPY creates array items currently. Code for insert has not
> been modified from last time. Making singleton inserts or insert
> into select, performant is still on the todo list.
>
> - Now we have a separate and dedicated meta-column btree alongside
> rest of the data column btrees. This special or first btree for
> meta-column is used to assign TIDs for tuples, track the UNDO
> location which provides visibility information. Also, this special
> btree, which always exists, helps to support zero-column tables
> (which can be a result of ADD COLUMN DROP COLUMN actions as
> well). Plus, having meta-data stored separately from data, helps to
> get better compression ratios. And also helps to further simplify
> the overall design/implementation as for deletes just need to edit
> the meta-column and avoid touching the actual data btrees. Index
> scans can just perform visibility checks based on this meta-column
> and fetch required datums only for visible tuples. For tuple locks
> also just need to access this meta-column only. Previously, every
> column btree used to carry the same undo pointer. Thus visibility
> check could be potentially performed, with the past layout, using
> any column. But considering overall simplification new layout
> provides it's fine to give up on that aspect. Having dedicated
> meta-column highly simplified handling for add columns with default
> and null values, as this column deterministically provides all the
> TIDs present in the table, which can't be said for any other data
> columns due to default or null values during add column.
>
> - Free Page Map implemented. The Free Page Map keeps track of unused
> pages in the relation. The FPM is also a b-tree, indexed by physical
> block number. To be more compact, it stores "extents", i.e. block
> ranges, rather than just blocks, when possible. An interesting paper [4]
> on
> how modern filesystems manage space acted as a good source for ideas.
>
> - Tuple locks implemented
>
> - Serializable isolation handled
>
> - With "default_table_access_method=zedstore"
> - 31 out of 194 failing regress tests
> - 10 out of 86 failing isolation tests
> Many of the current failing tests are due to plan differences, like
> Index scans selected for zedstore over IndexOnly scans, as zedstore
> doesn't yet have visibility map. I am yet to give a thought on
> index-only scans. Or plan diffs due to table size differences between
> heap and zedstore.
>
> Next few milestones we wish to hit for Zedstore:
> - Make check regress green
> - Make check isolation green
> - Zedstore crash safe (means also replication safe). Implement WAL
> logs
> - Performance profiling and optimizations for Insert, Selects, Index
> Scans, etc...
> - Once UNDO framework lands in Upstream, Zedstore leverages it instead
> of its own version of UNDO
>
> Open questions / discussion items:
>
> - how best to get "column projection list" from planner? (currently,
> we walk plan and find the columns required for the query in
> the executor, refer GetNeededColumnsForNode())
>
> - how to pass the "column projection list" to table AM? (as stated in
> initial email, currently we have modified table am API to pass the
> projection to AM)
>
> - TID treated as (block, offset) in current indexing code
>
> - Physical tlist optimization? (currently, we disabled it for
> zedstore)
>
> Team:
> Melanie joined Heikki and me to write code for zedstore. Majority of
> the code continues to be contributed by Heikki. We are continuing to
> have fun building column store implementation and iterate
> aggressively.
>
> References:
> 1] https://github.com/greenplum-db/postgres/tree/zedstore
> 2]
> https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru
> 3]
> https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development
> 4] https://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
>
>

In response to

Re: Zedstore - compressed in-core columnar storage at 2019-05-23 00:07:45 from Ashwin Agrawal

Responses

Re: Zedstore - compressed in-core columnar storage at 2019-05-24 22:37:08 from Ashwin Agrawal

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sascha Kuhl	2019-05-24 02:31:20	Indexing - comparison of tree structures
Previous Message	Stephen Frost	2019-05-24 02:30:09	Re: initdb recommendations