Re: Zedstore - compressed in-core columnar storage

From: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Alexandra Wang <lewang(at)pivotal(dot)io>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, DEV_OPS <devops(at)ww-it(dot)cn>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-09-17 11:15:11
Message-ID: CAE9k0PmpREo_xtQb_CqTFCKetmHv1LfMHWKpHgOA7geabLCnzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Aug 29, 2019 at 5:39 PM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 29/08/2019 14:30, Ashutosh Sharma wrote:
> >
> > On Wed, Aug 28, 2019 at 5:30 AM Alexandra Wang <lewang(at)pivotal(dot)io
> > <mailto:lewang(at)pivotal(dot)io>> wrote:
> >
> > You are correct that we currently go through each item in the leaf
> > page that
> > contains the given tid, specifically, the logic to retrieve all the
> > attribute
> > items inside a ZSAttStream is now moved to decode_attstream() in the
> > latest
> > code, and then in zsbt_attr_fetch() we again loop through each item we
> > previously retrieved from decode_attstream() and look for the given
> > tid.
> >
> >
> > Okay. Any idea why this new way of storing attribute data as streams
> > (lowerstream and upperstream) has been chosen just for the attributes
> > but not for tids. Are only attribute blocks compressed but not the tids
> > blocks?
>
> Right, only attribute blocks are currently compressed. Tid blocks need
> to be modified when there are UPDATEs or DELETE, so I think having to
> decompress and recompress them would be more costly. Also, there is no
> user data on the TID tree, and the Simple-8b encoded codewords used to
> represent the TIDs are already pretty compact. I'm not sure how much
> gain you would get from passing it through a general purpose compressor.
>
> I could be wrong though. We could certainly try it out, and see how it
> performs.
>
> > One
> > optimization we can to is to tell decode_attstream() to stop
> > decoding at the
> > tid we are interested in. We can also apply other tricks to speed up the
> > lookups in the page, for fixed length attribute, it is easy to do
> > binary search
> > instead of linear search, and for variable length attribute, we can
> > probably
> > try something that we didn't think of yet.
> >
> >
> > I think we can probably ask decode_attstream() to stop once it has found
> > the tid that we are searching for but then we only need to do that for
> > Index Scans.
>
> I've been thinking that we should add a few "bookmarks" on long streams,
> so that you could skip e.g. to the midpoint in a stream. It's a tradeoff
> though; when you add more information for random access, it makes the
> representation less compact.
>
> > Zedstore currently implement update as delete+insert, hence the old
> > tid is not
> > reused. We don't store the tuple in our UNDO log, and we only store the
> > transaction information in the UNDO log. Reusing the tid of the old
> > tuple means
> > putting the old tuple in the UNDO log, which we have not implemented
> > yet.
> >
> > OKay, so that means performing update on a non-key attribute would also
> > require changes in the index table. In short, HOT update is currently
> > not possible with zedstore table. Am I right?
>
> That's right. There's a lot of potential gain for doing HOT updates. For
> example, if you UPDATE one column on every row on a table, ideally you
> would only modify the attribute tree containing that column. But that
> hasn't been implemented.

Thanks Heikki for your reply. After quite some time today I got chance
to look back into the code. I could see that you have changed the
tuple insertion and update mechanism a bit. As per the latest changes
all the tuples being inserted/updated in a transaction are spooled
into a hash table and then flushed at the time of transaction commit
and probably due to this change, I could see that the server crashes
when trying to perform UPDATE operation on a zedstore table having 10
lacs record. See below example,

create table t1(a int, b int) using zedstore;
insert into t1 select i, i+10 from generate_series(1, 1000000) i;
postgres=# update t1 set b = 200;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Above update statement crashed due to some extensive memory leak.

Further, the UPDATE operation on zedstore table is very slow. I think
that's because in case of zedstore table we have to update all the
btree data structures even if one column is updated and that really
sucks. Please let me know if there is some other reason for it.

I also found some typos when going through the writeup in
zedstore_internal.h and thought of correcting those. Attached is the
patch with the changes.

Thanks,
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Attachment Content-Type Size
fix_typos.patch text/x-patch 1.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2019-09-17 11:18:06 Re: psql - improve test coverage from 41% to 88%
Previous Message Surafel Temesgen 2019-09-17 11:04:37 Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly