From: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
---|---|
To: | Alexandra Wang <lewang(at)pivotal(dot)io> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, DEV_OPS <devops(at)ww-it(dot)cn>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Taylor Vesely <tvesely(at)pivotal(dot)io> |
Subject: | Re: Zedstore - compressed in-core columnar storage |
Date: | 2019-09-30 10:38:00 |
Message-ID: | CAE9k0P=E_8pTPOQLcv5kRXTKSLU0Q_YjBq8eWB0yWvwS_yNqEw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Sep 27, 2019 at 3:09 PM Alexandra Wang <lewang(at)pivotal(dot)io> wrote:
>
> Hi Ashutosh,
>
> Sorry I indeed missed your question, thanks for the reminder!
>
> On Wed, Sep 25, 2019 at 4:10 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>>
>> > Further, the UPDATE operation on zedstore table is very slow. I think
>> > that's because in case of zedstore table we have to update all the
>> > btree data structures even if one column is updated and that really
>> > sucks. Please let me know if there is some other reason for it.
>> >
>>
>> There was no answer for this in your previous reply. It seems like you
>> missed it. As I said earlier, I tried performing UPDATE operation with
>> optimised build and found that to update around 10 lacs record in
>> zedstore table it takes around 24k ms whereas for normal heap table it
>> takes 2k ms. Is that because in case of zedstore table we have to
>> update all the Btree data structures even if one column is updated or
>> there is some other reason for it. If yes, could you please let us
>> know. FYI, I'm trying to update the table with just two columns.
>
>
> Zedstore UPDATE operation currently fetches the old rows, updates the
> undo pointers stored in the tid btree, and insert new rows into all
> the attribute btrees with the new tids. So performance of updating one
> column makes no difference from updating all the columns. That said,
> the wider the table is, the longer it takes to update, regardless
> updating one column or all the columns.
>
> However, since your test table only has two columns, and we also
> tested the same on a one-column table and got similar results as
> yours, there is definitely room for optimizations. Attached file
> zedstore_update_flames_lz4_first_update.svg is the profiling results
> for the update query on a one-column table with 1M records. It spent
> most of the time in zedstoream_fetch_row() and zsbt_tid_update(). For
> zedstoream_fetch_row(), Taylor and I had some interesting findings
> which I'm going to talk about next, I haven't dived into
> zsbt_tid_update() yet and need to think about it more.
>
> To understand what slows down zedstore UDPATE, Taylor and I did the
> following test and profiling on a zedstore table with only one column.
>
> postgres=# create table onecol(a int) using zedstore;
> postgres=# insert into onecol select i from generate_series(1, 1000000) i;
>
> -- Create view to count zedstore pages group by page types
> postgres=# CREATE VIEW pg_zs_page_counts AS
> SELECT
> c.relnamespace::regnamespace,
> c.oid,
> c.relname,
> pg_zs_page_type(c.oid, generate_series(0, c.relpages - 1)),
> count(*)
> FROM pg_am am
> JOIN pg_class c ON (c.relam = am.oid)
> WHERE am.amname='zedstore'
> GROUP BY 1,2,3,4;
>
> postgres=# select * from pg_zs_page_counts;
> relnamespace | oid | relname | pg_zs_page_type | count
> --------------+-------+---------+-----------------+-------
> public | 32768 | onecol | BTREE | 640
> public | 32768 | onecol | FREE | 90
> public | 32768 | onecol | META | 1
> (3 rows)
>
> -- Run update query the first time
> postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_first_update.svg
> Time: 28760.199 ms (00:28.760)
>
> postgres=# select * from pg_zs_page_counts;
> relnamespace | oid | relname | pg_zs_page_type | count
> --------------+-------+---------+-----------------+-------
> public | 32768 | onecol | BTREE | 6254
> public | 32768 | onecol | FREE | 26915
> public | 32768 | onecol | META | 1
> (6 rows)
>
Oops, the first UPDATE created a lot of free pages.
Just FYI, when the second update was ran, it took around 5 mins (which
is almost 10-12 times more than what 1st UPDATE took) but this time
there was no more free pages added, instead the already available free
pages were used. Here is the stats observed before and after second
update,
before:
=====
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 16390 | t1 | FREE | 26915
public | 16390 | t1 | BTREE | 7277
public | 16390 | t1 | META | 1
(3 rows)
after:
====
postgres=# select * from pg_zs_page_counts;
relnamespace | oid | relname | pg_zs_page_type | count
--------------+-------+---------+-----------------+-------
public | 16390 | t1 | FREE | 26370
public | 16390 | t1 | BTREE | 7822
public | 16390 | t1 | META | 1
(3 rows)
You may see that around 545 pages got added this time and they were
all taken from the free pages list.
> postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 0;
> count
> -------
> 5740
> (1 row)
>
This could be because currently tid blocks are not compressed as
against the other attribute blocks.
> postgres=# select count(*) from pg_zs_btree_pages('onecol') where attno = 1;
> count
> -------
> 514
> (1 row)
>
> postgres=# select * from pg_zs_btree_pages('onecol') where attno = 1 and totalsz > 0;
> blkno | nextblk | attno | level | lokey | hikey | nitems | ncompressed | totalsz | uncompressedsz | freespace
> -------+------------+-------+-------+---------+-----------------+--------+-------------+---------+----------------+-----------
> 730 | 6580 | 1 | 0 | 999901 | 1182451 | 1 | 1 | 3156 | 778480 | 4980
> 6580 | 13030 | 1 | 0 | 1182451 | 1380771 | 2 | 1 | 8125 | 859104 | 11
> 13030 | 19478 | 1 | 0 | 1380771 | 1579091 | 2 | 1 | 8125 | 859104 | 11
> 19478 | 25931 | 1 | 0 | 1579091 | 1777411 | 2 | 1 | 8125 | 859104 | 11
> 25931 | 32380 | 1 | 0 | 1777411 | 1975731 | 2 | 1 | 8125 | 859104 | 11
> 32380 | 4294967295 | 1 | 0 | 1975731 | 281474976645120 | 2 | 1 | 2033 | 105016 | 6103
> (6 rows)
>
> -- Run update query the second time
> postgres=# update onecol set a = 200; -- profiling attached in zedstore_update_flames_lz4_second_update.svg
> Time: 267135.703 ms (04:27.136)
>
> As you can see, it took 28s to run the update query for the first
> time, it was slow but expected. However, when we run the same update
> query again it took 4 mins and 27s, almost 10x slower than the first
> run. The profiling result of the second update is attached, it shows
> that 57% of all the time it's doing decode_chunk_fixed(), which is
> used for decoding a chunk in a attstream so that we can confirm
> whether the tid of interest is in that chunk and fetch it if true.
> Right now, each chunk contains at most 60 tids for fixed length
> attributes and at most 30 tids for varlena attributes, and we decode
> all the tids each chunk contains one by one.
>
> Going back to our test, before and after the first UPDATE, the BTREE
> page counts increased from 640 to 6254, however, only 6 out of the 514
> attribute btree pages actually store data. It seems like a bug that we
> left behind 508 empty btree pages, we should fix it, but let's put it
> aside as a seperate problem. With 6 pages we stored 1M rows, each page
> contains as many as 198,320 tids. This is the reason why the second
> UPDATE spent so much time at decoding chunks. The btree structure only
> helps us locate the page for a given tid, but once we get to the page,
> the better compression we have, the more chunks we can pack in one
> page, the more calls per page to decode_chunk(). Even worse, unlike
> INSERT, UPDATE currently initialize a new fetcher every time it
> fetches a new row, which means it doesn't remember the last position
> the decoder was at in the attstream, so everytime it fetches a new
> row, the decoder starts all over from the beginning of the attstream,
> and we are talking about an attstream that could have 198,320 records.
> We also haven't done any optimization inside of decode_chunk() itself,
> like checking first and last tid, stop decoding once found the tid, or
> doing binary search for fixed length attributes.
>
> So, I think what slows down the second UPDATE are also part of the
> reasons why the first UPDATE is slow. We still haven't done any
> optimization for UPDATE so far, probably because we didn't expect it
> to be better than heap, but we should try to make it not too much
> worse.
>
That's right, if the situation is too worse, it would be difficult to
compromise. So, some fix is certainly required here.
>>
>> Further, In the latest code I'm getting this warning message when it
>> is compiled using -O2 optimisation flag.
>>
>> zedstore_tidpage.c: In function ‘zsbt_collect_dead_tids’:
>> zedstore_tidpage.c:978:10: warning: ‘page’ may be used uninitialized
>> in this function [-Wmaybe-uninitialized]
>> opaque = ZSBtreePageGetOpaque(page);
>> ^
>> Attached is the patch that fixes it.
>
>
> Applied. Thanks!
Thanks for that and for sharing the detail information on why update
operation on zedstore table is so slow.
--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2019-09-30 10:42:07 | Re: pgbench - allow to create partitioned tables |
Previous Message | Rushabh Lathia | 2019-09-30 10:07:09 | Re: backup manifests |