Re: Zedstore - compressed in-core columnar storage

From: Alexandra Wang <lewang(at)pivotal(dot)io>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, DEV_OPS <devops(at)ww-it(dot)cn>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Taylor Vesely <tvesely(at)pivotal(dot)io>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-09-27 09:39:09
Message-ID: CACiyaSoMg684uPYGG4YsmgNpEmiMgoEuMm342_Xe8-5q9ma5yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Ashutosh,

Sorry I indeed missed your question, thanks for the reminder!

On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
wrote:

> > Further, the UPDATE operation on zedstore table is very slow. I think
> > that's because in case of zedstore table we have to update all the
> > btree data structures even if one column is updated and that really
> > sucks. Please let me know if there is some other reason for it.
> >
>
> There was no answer for this in your previous reply. It seems like you
> missed it. As I said earlier, I tried performing UPDATE operation with
> optimised build and found that to update around 10 lacs record in
> zedstore table it takes around 24k ms whereas for normal heap table it
> takes 2k ms. Is that because in case of zedstore table we have to
> update all the Btree data structures even if one column is updated or
> there is some other reason for it. If yes, could you please let us
> know. FYI, I'm trying to update the table with just two columns.
>

Zedstore UPDATE operation currently fetches the old rows, updates the
undo pointers stored in the tid btree, and insert new rows into all
the attribute btrees with the new tids. So performance of updating one
column makes no difference from updating all the columns. That said,
the wider the table is, the longer it takes to update, regardless
updating one column or all the columns.

However, since your test table only has two columns, and we also
tested the same on a one-column table and got similar results as
yours, there is definitely room for optimizations. Attached file
zedstore_update_flames_lz4_first_update.svg is the profiling results
for the update query on a one-column table with 1M records. It spent
most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
zedstoream_fetch_row(), Taylor and I had some interesting findings
which I'm going to talk about next, I haven't dived into
zsbt_tid_update() yet and need to think about it more.

To understand what slows down zedstore UDPATE, Taylor and I did the
following test and profiling on a zedstore table with only one column.

postgres=# create table onecol(a int) using zedstore;
postgres=# insert into onecol select i from generate_series(1, 1000000) i;

-- Create view to count zedstore pages group by page types
postgres=# CREATE VIEW pg_zs_page_counts AS
SELECT
c.relnamespace::regnamespace,
c.oid,
c.relname,
pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
count(*)
FROM pg_am am
JOIN pg_class c ON (c.relam = am.oid)
WHERE am.amname='zedstore'
GROUP BY 1,2,3,4;

postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 640
public | 32768 | onecol | FREE | 90
public | 32768 | onecol | META | 1
(3 rows)

-- Run update query the first time
postgres=# update onecol set a = 200; -- profiling attached in
zedstore_update_flames_lz4_first_update.svg
Time: 28760.199 ms (00:28.760)

postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 32768 | onecol | BTREE | 6254
public | 32768 | onecol | FREE | 26915
public | 32768 | onecol | META | 1
(6 rows)

postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
count
-------
5740
(1 row)

postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
count
-------
514
(1 row)

postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and
totalsz > 0;
blkno | nextblk | attno | level | lokey | hikey | nitems |
ncompressed | totalsz | uncompressedsz | freespace
-------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 |
1 | 3156 | 778480 | 4980
6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 |
1 | 8125 | 859104 | 11
13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 |
1 | 8125 | 859104 | 11
19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 |
1 | 8125 | 859104 | 11
25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 |
1 | 8125 | 859104 | 11
32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 |
1 | 2033 | 105016 | 6103
(6 rows)

-- Run update query the second time
postgres=# update onecol set a = 200; -- profiling attached in
zedstore_update_flames_lz4_second_update.svg
Time: 267135.703 ms (04:27.136)

As you can see, it took 28s to run the update query for the first
time, it was slow but expected. However, when we run the same update
query again it took 4 mins and 27s, almost 10x slower than the first
run. The profiling result of the second update is attached, it shows
that 57% of all the time it's doing decode_chunk_fixed(), which is
used for decoding a chunk in a attstream so that we can confirm
whether the tid of interest is in that chunk and fetch it if true.
Right now, each chunk contains at most 60 tids for fixed length
attributes and at most 30 tids for varlena attributes, and we decode
all the tids each chunk contains one by one.

Going back to our test, before and after the first UPDATE, the BTREE
page counts increased from 640 to 6254, however, only 6 out of the 514
attribute btree pages actually store data. It seems like a bug that we
left behind 508 empty btree pages, we should fix it, but let's put it
aside as a seperate problem. With 6 pages we stored 1M rows, each page
contains as many as 198,320 tids. This is the reason why the second
UPDATE spent so much time at decoding chunks. The btree structure only
helps us locate the page for a given tid, but once we get to the page,
the better compression we have, the more chunks we can pack in one
page, the more calls per page to decode_chunk(). Even worse, unlike
INSERT, UPDATE currently initialize a new fetcher every time it
fetches a new row, which means it doesn't remember the last position
the decoder was at in the attstream, so everytime it fetches a new
row, the decoder starts all over from the beginning of the attstream,
and we are talking about an attstream that could have 198,320 records.
We also haven't done any optimization inside of decode_chunk() itself,
like checking first and last tid, stop decoding once found the tid, or
doing binary search for fixed length attributes.

So, I think what slows down the second UPDATE are also part of the
reasons why the first UPDATE is slow. We still haven't done any
optimization for UPDATE so far, probably because we didn't expect it
to be better than heap, but we should try to make it not too much
worse.

> Further, In the latest code I'm getting this warning message when it
> is compiled using -O2 optimisation flag.
>
> zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
> zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
> in this function [-Wmaybe-uninitialized]
> opaque = ZSBtreePageGetOpaque(page);
> ^
> Attached is the patch that fixes it.
>

Applied. Thanks!

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2019-09-27 09:40:46 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
Previous Message Amit Kapila 2019-09-27 09:03:32 Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions