Re: Pluggable storage

From: Andres Freund <andres(at)anarazel(dot)de>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Pluggable storage
Date: 2017-08-15 06:53:48
Message-ID: 20170815065348.afkj45dgmg22ecfh@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote:
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.

Making a pass through the patchset to get a feel where this at, and
where this is headed. I previously skimmed the thread to get a rough
sense on what's discused, but not in a very detailed manner.

General:

- I think one important discussion we need to have is what kind of
performance impact we're going to accept introducing this. It seems
very likely that this'll cause some slowdown. We can kind of
alleviate that by doing some optimizations at the same time, but
nevertheless, this abstraction is going to cost.

- I don't think we should introduce this without a user besides
heapam. The likelihood that API will be usable by anything else
without a testcase seems fairly remote. I think some src/test/modules
type implementation of a per-session, in-memory storage - relatively
easy to implement - or such is necessary.

- I think, and detailed some of that, we're going to need some cleanups
that go in before this, to decrease the size / increase the quality of
the new APIs. It's going to get more painful to change APIs
subsequently.

- We'll have to document clearly that these APIs are going to change for
a while, even after the release introducing them.

StorageAm - Scan stuff:

- I think API isn't quite right. There's a lot of granular callback
functionality like scan_begin_function / scan_begin_catalog /
scan_begin_bm - these largely are convenience wrappers around the same
function, and I don't think that would, or rather should, change in
any abstracted storage layer. So I think there needs to be some
unification first (pretty close w/ beginscan_internal already, but
perhaps we should get rid of a few of these wrappers).

- Some of the exposed functionality, e.g. scan_getpage,
scan_update_snapshot, scan_rescan_set_params looks like it should just
be excised, i.e. there's no reason for it to exist.

- Minor: don't think the _function suffix for Storis necessary, just
makes things long, and every member has it. Besides that, it's also
easy to misunderstand - for a second I understood
scan_getnext_function to be about getting the next function...

- Scans are still represented as HeapScanDesc - I don't think that's
going to fly. Either this needs to be an opaque type (i.e. a struct
that's not defined, just forward declared), or it needs to be a base
struct that individual AMs embed in their own structs. Individual AMs
definitely are going to need different pieces of data.

Storage AM - tuple stuff:

- tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are
each individual functions, that seems pretty painful to maintain, and
v/ likely to just grow and grow. Not sure what the solution is, but
this seems like a hard sell.

- The three *speculative* functions don't quite seem right to me, nor do
I understand:
+ *
+ * Setting a tuple's speculative token is a slot-only operation, so no need
+ * for a storage AM method, but after inserting a tuple containing a
+ * speculative token, the insertion must be completed by these routines:
+ */
I don't see anything related to slots, here?

Storage AM - slot stuff:

- I think there's a number of wrapper functions (slot_getattr,
slot_getallattrs, getsomeattrs, attisnull) around the same
functionality - that bloats the API and causes slowdowns. Instead we
need something like slot_virtualize_tuple(int16 upto), and the rest
should just be wrappers.

- I think it's wrong to have the slot functionality defined on the
StorageAm level. That'll cause even more indirect function calls (=>
slowness), and besides that the TupleTableSlot struct will need
members for each and every Am.

I think the solution is to instead have an Am level "create slot"
function, and the returned slot is allocated by the Am, with a base
member of TupleTableSlot with basically just tts_nvalid, tts_values,
tts_isnull as members. Those are the only members that can be
accessed without functions.

Access to the individual functions (say store_tuple) would then be
direct members of the TupleTableSlot interface. While that costs a bit
of memory, it removes one indirection from an already performance
critical path.

- MinimalTuples should be one type of slot for the above, except it's
not created by an StorageAm but by a function returning a
TupleTableSlot.

This should remove the need for the slot_copy_min_tuple,
slot_is_physical_tuple functions.

- Right now TupleTableSlots are an executor datastructure, but
these patches (rightly!) make it much more widely used. So I think it
needs to be moved outside of executor/, and probably renamed to
something like TupleHolder or something.

- The oid integration seems wrong - without an accessor oids won't be
queryable with this unless you break through the API. But from a
higher level view I do wonder if it's not time to remove "special" oid
columns and replace them with a normal column. We should be hesitant
enshrining crusty old concepts in new APIs.

Executor integration:

- I'm quite fearful that this'll cause slowdowns in a few tight paths.
The most likely cases here seem to be a) bitmap indexscans b)
indexscans c) highly selective sequential scans. I do wonder if
that can be partially addressed by switching out the individual
executor routines in the relevant scan nodes by something using or
similar to the infrastructure in cc9f08b6b8

Regards,

Andres

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-08-15 06:58:57 Generate wait event list and docs from text file
Previous Message Chris Travers 2017-08-15 06:24:37 Re: Orphaned files in base/[oid]