From: | Hubert Zhang <hzhang(at)pivotal(dot)io> |
---|---|
To: | Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Gang Xiong <gxiong(at)pivotal(dot)io>, Asim R P <apraveen(at)pivotal(dot)io>, Ning Yu <nyu(at)pivotal(dot)io> |
Subject: | Re: Yet another vectorized engine |
Date: | 2019-12-04 09:13:57 |
Message-ID: | CAB0yrenYmbYsioz167OrcO_8wVsvb=MA381-McLNcjEb1EJQYg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Thanks Konstantin for your detailed review!
On Tue, Dec 3, 2019 at 5:58 PM Konstantin Knizhnik <
k(dot)knizhnik(at)postgrespro(dot)ru> wrote:
>
>
> On 02.12.2019 4:15, Hubert Zhang wrote:
>
>
> The prototype extension is at https://github.com/zhangh43/vectorize_engine
>
>
> I am very sorry, that I have no followed this link.
> Few questions concerning your design decisions:
>
> 1. Will it be more efficient to use native arrays in vtype instead of
> array of Datum? I think it will allow compiler to generate more efficient
> code for operations with float4 and int32 types.
> It is possible to use union to keep fixed size of vtype.
Yes, I'm also considering that when scan a column store, the column batch
is loaded into a continuous memory region. For int32, the size of this
region is 4*BATCHSIZE, while for int16, the size is 2*BATCHSIZE. So using
native array could just do a single memcpy to fill the vtype batch.
> 2. Why VectorTupleSlot contains array (batch) of heap tuples rather than
> vectors (array of vtype)?
>
a. VectorTupleSlot stores array of vtype in tts_values field which is used
to reduce the code change and reuse functions like ExecProject. Of course
we could use separate field to store vtypes.
b. VectorTupleSlot also contains array of heap tuples. This used to do heap
tuple deform. In fact, the tuples in a batch may across many pages, so we
also need to pin an array of related pages instead of just one page.
3. Why you have to implement your own plan_tree_mutator and not using
> expression_tree_mutator?
>
I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg
implementation). expression_tree_mutator cannot be used to mutate plan node
such as Agg, am I right?
> 4. As far as I understand you now always try to replace SeqScan with your
> custom vectorized scan. But it makes sense only if there are quals for this
> scan or aggregation is performed.
> In other cases batch+unbatch just adds extra overhead, doesn't it?
>
Probably extra overhead for heap format and query like 'select i from t;'
without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no
additional batch cost. Column store is the better choice for OLAP queries.
Can we conclude that it would be better to use vector engine for OLAP
queries and row engine for OLTP queries.
5. Throwing and catching exception for queries which can not be vectorized
> seems to be not the safest and most efficient way of handling such cases.
> May be it is better to return error code in plan_tree_mutator and
> propagate this error upstairs?
Yes, as for efficiency, another way is to enable some plan node to be
vectorized and leave other nodes not vectorized and add batch/unbatch layer
between them(Is this what you said "propagate this error upstairs"). As you
mentioned, this could introduce additional overhead. Is there any other
good approaches?
What do you mean by not safest? PG catch will receive the ERROR, and
fallback to the original non-vectorized plan.
> 6. Have you experimented with different batch size? I have done similar
> experiments in VOPS and find out that tile size larger than 128 are not
> providing noticable increase of performance.
> You are currently using batch size 1024 which is significantly larger than
> typical amount of tuples on one page.
>
Good point, We will do some experiments on it.
7. How vectorized scan can be combined with parallel execution (it is
> already supported in9.6, isn't it?)
>
We didn't implement it yet. But the idea is the same as non parallel one.
Copy the current parallel scan and implement vectorized Gather, keeping
their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make
them vectorized, then tuning performance gradually.
--
Thanks
Hubert Zhang
From | Date | Subject | |
---|---|---|---|
Next Message | Arthur Zakirov | 2019-12-04 09:15:52 | Re: pg_upgrade fails with non-standard ACL |
Previous Message | Peter Eisentraut | 2019-12-04 09:09:28 | Re: Proposal: Add more compile-time asserts to expose inconsistencies. |