Re: Zedstore - compressed in-core columnar storage

From: Ashwin Agrawal <aagrawal(at)pivotal(dot)io>
To: Ajin Cherian <itsajin(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Zedstore - compressed in-core columnar storage
Date: 2019-05-24 22:37:08
Message-ID: CALfoeivu-n5o8Juz9wW+kTjnis6_+rfMf+zOTky1LiTVk-ZFjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, May 23, 2019 at 7:30 PM Ajin Cherian <itsajin(at)gmail(dot)com> wrote:

> Hi Ashwin,
>
> - how to pass the "column projection list" to table AM? (as stated in
> initial email, currently we have modified table am API to pass the
> projection to AM)
>
> We were working on a similar columnar storage using pluggable APIs; one
> idea that we thought of was to modify the scan slot based on the targetlist
> to have only the relevant columns in the scan descriptor. This way the
> table AMs are passed a slot with only relevant columns in the descriptor.
> Today we do something similar to the result slot using
> ExecInitResultTypeTL(), now do it to the scan tuple slot as well. So
> somewhere after creating the scan slot using ExecInitScanTupleSlot(), call
> a table am handler API to modify the scan tuple slot based on the
> targetlist, a probable name for the new table am handler would be:
> exec_init_scan_slot_tl(PlanState *planstate, TupleTableSlot *slot).
>

Interesting.

Though this reads hacky and not clean approach to me. Reasons:

- The memory allocation and initialization for slot descriptor was
done in ExecInitScanTupleSlot(). exec_init_scan_slot_tl() would
redo lot of work. ExecInitScanTupleSlot() ideally just points to
tupleDesc from Relation object. But for exec_init_scan_slot_tl()
will free the existing tupleDesc and reallocate fresh. Plus, can't
point to Relation tuple desc but essentially need to craft one out.

- As discussed in thread [1], several places want to use different
slots for the same scan, so that means will have to modify the
descriptor every time on such occasions even if it remains the same
throughout the scan. Some extra code can be added to keep around old
tupledescriptor and then reuse for next slot, but that seems again
added code complexity.

- AM needs to know the attnum in terms of relation's attribute number
to scan. How would tupledesc convey that? Like TupleDescData's attrs
currently carries info for attnum at attrs[attnum - 1]. If TupleDesc
needs to convey random attributes to scan, seems this relationship
has to be broken. attrs[offset] will provide info for some attribute
in relation, means offset != (attrs->attnum + 1). Which I am not
sure how many places in code rely on that logic to get information.

- The tupledesc provides lot of information not just attribute numbers
to scan. Like it provides information in TupleConstr about default
value for column. If AM layer has to modify existing slot's
tupledesc, it would have to copy over such information as well. This
information today is fetched using attnum as offset value in
constr->missing array. If this information will be retained how will
the constr array constructed? Will the array contain only values for
columns to scan or will contain constr array as is from Relation's
tuple descriptor as it does today. Seems will be overhead to
construct the constr array fresh and if not constructing fresh seems
will have mismatch between natt and array elements.

Seems with the proposed exec_init_scan_slot_tl() API, will have to
call it after beginscan and before calling getnextslot, to provide
column projection list to AM. Special dedicated API we have for
Zedstore to pass down column projection list, needs same calling
convention which is the reason I don't like it and trying to find
alternative. But at least the api we added for Zedstore seems much
simple, generic and flexible, in comparison, as lets AM decide what it
wishes to do with it. AM can fiddle with slot's TupleDescriptor if
wishes or can handle the column projection some other way.

So this way the scan am handlers like getnextslot is passed a slot only
> having the relevant columns in the scan descriptor. One issue though is
> that the beginscan is not passed the slot, so if some memory allocation
> needs to be done based on the column list, it can't be done in beginscan.
> Let me know what you think.
>

Yes, ideally would like to see if possible having this information
available on beginscan. But if can't be then seems fine to delay such
allocations on first calls to getnextslot and friends, that's how we
do today for Zedstore.

1]
https://www.postgresql.org/message-id/20190508214627.hw7wuqwawunhynj6%40alap3.anarazel.de

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2019-05-24 22:42:39 Re: [HACKERS] Runtime Partition Pruning
Previous Message Mat Arye 2019-05-24 21:05:34 Question about some changes in 11.3