From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Status of the table access method work |
Date: | 2019-04-05 20:25:38 |
Message-ID: | 20190405202538.vu7sffsdqqvytmt2@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
In this email I want to give a brief status update of the table access
method work - I assume that most of you sensibly haven't followed it
into all nooks and crannies.
I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.
I think what's in v12 - I don't know of any non-cleanup / bugfix work
pending for 12 - is a pretty reasonable initial set of features. It
allows to reimplement a heap like storage without any core modifications
(except WAL logging, see below); it is not sufficient to implement a
good index oriented table AM. It does not allow to store the catalog in
a non heap table.
The tableam interface itself doesn't care that much about the AM
internally stores data. Most of the API (sequential scans, index
lookups, insert/update/delete) don't know about blocks, and only
indirectly & optionally about buffers (via BulkInsertState). There's a
few callbacks / functions that do care about blocks, because it's not
clear, or would have been too much work, to remove the dependency. This
currently is:
- ANALYZE integration - currently the sampling logic is tied to blocks.
- index build range scans - the range is defined as blocks
- planner relation size estimate - but that could trivially just be
filled with size-in-bytes / BLCKSZin the callback.
- the (optional) bitmap heap scan API - that's fairly intrinsically
block based. An AM could just internally subdivide TIDs in a different
way, but I don't think a bitmap scan like we have would e.g. make a
lot of sense for an index oriented table without any sort of stable
tid.
- the sample scan API - tsmapi.h is block based, so the tableam.h API is
as well.
I think none of these are limiting in a particularly bad way.
The most constraining factor for storage, I think, is that currently the
API relies on ItemPointerData style TIDs in a number of places (i.e. a 6
byte tuple identifier). One can implement scans, and inserts into
index-less tables without providing that, but no updates, deletes etc.
One reason for that is that it'd just have required more changes to
executor etc to allow for wider identifiers, but the primary reason is
that indexes currently simply don't support anything else.
I think this is, by far, the biggest limitation of the API. If one
e.g. wanted to implement a practical index-organized-table, the 6 byte
limitation obviously would become a limitation very quickly. I suspect
that we're going to want to get rid of that limitation in indexes before
long for other reasons too, to allow global indexes (which'd need to
encode the partition somewhere).
With regards to storing the rows themselves, the second biggest
limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem). I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.
Currently there's some, fairly minor, requirement that TIDs are actually
unique when not using a snapshot qualifier. That's currently only
relevant for GetTupleForTrigger(), AfterTriggerSaveEvent() and
EvalPlanQualFetchRowMarks(), which use SnapshotAny. That prevents AMs
from implementing in-place updates (thus a problem e.g. for zheap).
I've a patch that fixes that, but it's too hacky for v12 - there's not
always a convenient snapshot to fetch a row (e.g. in
GetTupleForTrigger() after EPQ the row isn't visible to
es_snapshot).
A second set of limitations is around making more of tableam
optional. Right now it e.g. is not possible to have an AM that doesn't
implement insert/update/delete. Obviously an AM can just throw an error
in the relevant callbacks, but I think it'd be better if we made those
callbacks optional, and threw errors at parse-analysis time (both to
make the errors consistent, and to ensure it's consistently thrown,
rather than only when e.g. an UPDATE actually finds a row to update).
Currently foreign keys are allowed between tables of different types of
AM. I am wondering whether we ought to allow AMs to forbid being
referenced. If e.g. an AM has lower consistency guarantees than the AM
of the table referencing it, it might be preferrable to forbid
that. OTOH, I guess such an AM could just require UNLOGGED to be used.
Another restriction is actually related to UNLOGGED - currently the
UNLOGGED processing after crashes works by recognizing init forks by
file name. But what if e.g. the storage isn't inside postgres files? Not
sure if we actually can do anything good about that.
The last issue I know about is that nodeBitmapHeapscan.c and
nodeIndexOnlyscan.c currently directly accesses the visibilitymap. Which
means if an AM doesn't use the VM, they're never going to use the
optimized path. And conversely if the AM uses the VM, it needs to
internally map tids in way compatible with heap. I strongly suspect
that we're going to have to fix this quite soon.
It'd be a pretty significant amount of work to allow storing catalogs in
a non-heap table. One difficulty is that there's just a lot of direct
accesses to catalog via heapam.h APIs - while a significant amount of
work to "fix" that, it's probably not very hard for each individual
site. There's a few places that rely on heap internals (checking xmin
for invalidation and the like). I think the biggest issue however would
be the catalog bootstrapping - to be able to read pg_am, we obviously
need to go through relcache.c's bootstrapping, and that only works
because we hardcode how those tables look like. I personally don't
think it's particularly important issue to work on, nor am I convinced
that there'd be buy-in to make the necessary extensive changes.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2019-04-05 20:41:54 | Re: [PATCH v20] GSSAPI encryption support |
Previous Message | Robert Haas | 2019-04-05 19:57:33 | Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3 |