Re: Zedstore - compressed in-core columnar storage

From: Alexandra Wang <lewang(at)pivotal(dot)io>
To: Taylor Vesely <tvesely(at)pivotal(dot)io>
Cc: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Ashwin Agrawal <aagrawal(at)pivotal(dot)io>, DEV_OPS <devops(at)ww-it(dot)cn>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2020-03-30 20:00:05
Message-ID: CACiyaSqAOXngeu+yx1WgVfeLxGzMdTiyGamh2VG1iQgROOt_Dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

On Wed, Oct 30, 2019 at 3:34 PM Taylor Vesely <tvesely(at)pivotal(dot)io> wrote:
> Because zedstore resembled random I/O, the read speed was
> significantly hindered on our single disk. As a result, we saw ~150x
> slower read speeds.

Deep and I have committed a fix for this. The root cause of this problem is
that attribute tree (and tid tree) pages are not contiguous enough on disk,
especially if we are loading data concurrently into the same table. The
effect
of the non-contiguity is especially felt on rotational disks and is
magnified
by increasing the number of sessions that load the table concurrently.

Since a base requirement for a column store is that blocks for a single
column
be physically adjacent (or nearly adjacent), we sought to optimize for this
requirement.

What we have done is to introduce attribute-level free page maps (FPMs) and
to
batch up relation extension requests. We have introduced a new reloption
zedstore_rel_extension_factor: whenever we want to extend the relation by a
single page for the tid/attribute tree, we extend it by
zedstore_rel_extension_factor number of blocks. We return the one block
requested and prepend the extra blocks to the attribute-level FPM. This
makes
the blocks available to other concurrent backends and thus, in spite of
backend-interleaved flushes into the same attribute tree, we see more
contiguity of leaf blocks.

We reason about contiguity of blocks by making some major assumptions:
We consider that two blocks are near each other if they have block numbers
that
are close to each other. (Refer: BufferGetBlockNumber())
Also we assume that if two successive relation extension requests would
yield
blocks with block numbers that are close to each other.

Recycling of pages for attribute and tid tree also are now done at the
attribute-tree level.

Experiment results and methodology:

Metric used to measure performance -> I/O read time reported by the “I/O
Timings” field in: explain (analyze, buffers, timing, verbose) output with
the
track_io_timing GUC on. Before every explain run, we restart the database to
flush the buffers and clear the OS page cache.

Experiment parameters: TPC-DS Scale = 270, table = store_sales, opt_level =
-O2
#parallel COPY sessions loading store_sales = 16.
N = zedstore_rel_extension_factor

GUCs used:

shared_buffers: 10GB
max_wal_size: 1GB
checkpoint_flush_after: 1MB
max_parallel_workers: 8
max_parallel_maintenance_workers: 8
maintenance_work_mem: 4GB
log_statement: all
effective_cache_size: 32GB
track_io_timing: on

For rotational disks:

Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 115s. Total exec time =
212s
Zed (w/o fix): Table size = 59G. I/O time = 634s. Total exec time =
730s
Zed (N=32): Table size = 59G. I/O time = 91s. Total exec time =
175s
Zed (N=512): Table size = 59G. I/O time = 7s. Total exec time = 87s
Zed (N=4096): Table size = 59G. I/O time = 2.5s. Total exec time = 82s

Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 130s. Total exec time =
214s
Zed (w/o fix): Table size = 59G. I/O time = 2401s. Total exec time =
2813s
Zed (N=32): Table size = 59G. I/O time = 929s. Total exec time =
1300s
Zed (N=512): Table size = 59G. I/O time = 485s. Total exec time =
847s
Zed (N=4096): Table size = 59G. I/O time = 354s. Total exec time =
716s

We also saw discernible differences in I/O time for scale = 50, table size
= 10G
for Zedstore and 21G for heap. Results not reported for brevity.

Our fix doesn't impact COPY performance, so we saw no difference in the time
taken to load the data into store_sales.

For NVMe SSDs:
We see no discernible differences in I/O times with and without the fix
(performance for select * was slightly worse for N=4096). Here
are some of the results:

Query: select ss_sold_date_sk from store_sales;
Heap: Table size = 112G. I/O time = 59s. Total exec time = 123s
Zed (w/o fix): Table size = 59G. I/O time = 20s. Total exec time = 79s
Zed (N=4096): Table size = 59G. I/O time = 21s. Total exec time = 87s

Query: select * from store_sales;
Heap: Table size = 112G. I/O time = 64s. Total exec time = 127s
Zed (w/o fix): Table size = 61G. I/O time = 449s. Total exec time = 757s
Zed (N=4096): Table size = 61G. I/O time = 487s. Total exec time = 812s

Analysis of fix:

The following query inspects the (block distance) absolute difference
between
two logically adjacent leaf blocks for the ss_sold_date_sk attribute of
store_sales. It shows us the distribution of the block distances in the
ss_sold_date_sk attribute tree. Output is limited for brevity.

with blk_dist(dist) as (select abs(nextblk - blkno) as dist from
pg_zs_btree_pages('store_sales'::regclass) where attno=1 and level=0 and
nextblk != 4294967295)
select dist, count(dist) as cnt from blk_dist group by
dist order by cnt desc limit 5;

W/o fix: #parallel_copies=16,
W/ fix: #parallel_copies=16, extension_factor=16

W/o fix W/ fix

dist | cnt dist | cnt
-----+----- -----+------
25 | 89 1 | 3228
26 | 83 2 | 3192
23 | 78 3 | 2664
1 | 75 4 | 2218
29 | 74 5 | 1866

We can see that by increasing zedstore_rel_extension_factor, we end up with
a high number of lower block distances.

Implications of fix:

1. We have to keep track of the FPM heads for the attribute/tid trees in the
meta-page, and since we don't have an extensible meta-page yet, we further
limit
the number of columns Zedstore can support. We will get around to it
eventually.

2. Worst case extra space wasted on disk from extra free pages that could
linger
after a bulk load = zedstore_rel_extension_factor * #attributes * 8192
bytes.

For zedstore_rel_extension_factor = 16, #attributes = 23:
wastage = 16*24*8192/1024/1024 = 3M
For zedstore_rel_extension_factor = 4096, #attributes = 23:
wastage = 4096*24*8192/1024/1024 = 768M

Note: The free pages left behind can of course, be used by subsequent
operations
on the table.

In conclusion, increasing zedstore_rel_extension_factor for a wide table may
lead to bloating of the relfile. The percentage of bloat would also be
magnified
if the table doesn't have a lot of data.

3. Amount of extra WAL being written (since we are placing/removing the
extra
blocks on the FPMs, something we never did without this fix) is independent
of
zedstore_rel_extension_factor and we found that we had written
approximately 14M
extra WAL for every 1G relfile.

Guidance on setting zedstore_rel_extension_factor:

Users should set a high zedstore_rel_extension_factor, when they are loading
data on rotational disks, with/without a high degree of concurrency and when
they have significant data size.

Attached is a patch with our changes: [1]
Also attached is a rebased version of Zedstore on latest PG master. [2]
Github branch for Zedstore: [3]

[1] 0001-Attribute-level-FPMs-and-rel-extension-batching.patch
[2] v4-zedstore.patch
[3] https://github.com/greenplum-db/postgres/tree/zedstore

--
Alex & Deep

Attachment Content-Type Size
0001-Attribute-level-FPMs-and-rel-extension-batching.patch application/octet-stream 23.8 KB
v4-zedstore.patch application/octet-stream 2.0 MB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-03-30 20:16:10 Re: backup manifests
Previous Message Robert Haas 2020-03-30 19:23:08 Re: backup manifests