Re: Pluggable storage

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Pluggable storage
Date: 2017-07-16 00:58:40
Message-ID: CAH2-Wz=wdsyR1cK0GNp2jHa8g2tU_vvb_9OnP=HTy7Vp_BRCmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jul 15, 2017 at 3:36 PM, Alexander Korotkov
<a(dot)korotkov(at)postgrespro(dot)ru> wrote:
> I think that pruning and vacuum are artifacts of not only current heap
> formats, but they are also artifacts of current index AM API. And this is
> more significant circumstance given that we're going to preserve
> compatibility of new storage engines with current index AMs. Our current
> index AM API assumes that we can delete from index only in bulk manner. Our
> payload to index key is TID, not arbitrary piece of data. And that payload
> can't be updated.

I agree that this is a big set of problems. This is where I think we
can get the most benefit.

One nice thing about having a fully logical TID is that you don't have
to bloat indexes just because a HOT update was not possible. You bloat
something else instead, certainly, but that can be optimized for
garbage collection. Not all bloat is equal. MVCC based on UNDO can be
very fast because UNDO is very well optimized for garbage collection,
and so can be bloated with no long term consequences, and minor short
term consequences. Still, it isn't fair to blame the Postgres VACUUM
design for the fact that Postgres bloats indexes so easily. Nothing
stops us from adding a new layer of indirection, so that bloating an
index degrades into something that is not even as bad as bloating the
heap [1]. We may just have a data structure problem, which is not
nearly the same thing as a fundamental design problem in the storage
layer.

This approach could pay off soon if we start with unique indexes,
where there is "queue" of row versions that can be pruned with simple
logic, temporal locality helps a lot, only zero or one versions can be
visible to your MVCC snapshot, etc. This might require only minimal
revisions the index AM API, to help nbtree. We could later improve
this so that you bloat UNDO instead of bloating a heap-like structure,
both for indexes and for the actual heap. That seems less urgent.

To repeat myself, for emphasis: *Not all bloat is equal*. Index bloat
makes the way a B-Tree's keyspace is split up far too granular, making
pages sparely packed, a problem that is more or less *irreversible* by
VACUUM or any garbage collection process [2]. That's how B-Trees work
-- they're optimized for concurrency, not for garbage collection.

>> InnoDB isn't much like the PostgreSQL heap, and
>> neither is SQL Server, IIUC. If we're creating a heap format that can
>> only be different in trivial ways from what we have now, and anything
>> that really changes the paradigm is off-limits, well, that's not
>> really interesting enough to justify the work of creating a heap
>> storage API.
>
>
> My concern is that we probably can't do anything that really changes
> paradigm while preserving compatibility with index AM API. If you don't
> agree with that, it would be good to provide some examples. It seems
> unlikely for me that we're going to have something like InnoDB or SQL Server
> table with our current index AM API. InnoDB utilizes index-organized tables
> where primary and secondary indexes are versioned independently. SQL Server
> utilizes flat data structure similar to our heap, but MVCC implementation
> also seems very different.

I strongly agree. I simply don't understand how you can adopt UNDO for
MVCC, and yet expect to get a benefit commensurate with the effort
without also implementing "retail index tuple deletion" first.
Pursuing UNDO this way has the same problem that WARM likely has -- it
doesn't really help with the worst case, where users get big,
unpleasant surprises. Postgres is probably the only major database
system that doesn't support retail index tuple deletion. It's a basic
thing, that has nothing to do with MVCC. Really, what do we have to
lose?

The biggest weakness of the current design is IMV how it fails to
prevent index bloat in the first place, but avoiding bloating index
leaf pages in the first place doesn't seem incompatible with how
VACUUM works. Or at least, let's not assume that it is. We should
avoid throwing the baby out with the bathwater.

> I think in general there are two ways dealing with out index AM API
> limitation. One of them is to extend index AM API. At least, we would need
> a method for deletion of individual index tuple (for sure, we already have
> kill_prior_tuple but it's just a hint for now).

kill_prior_tuple can work well, but, like HOT, it works
inconsistently, in a way that is hard to predict.

> Also, it would be nice to
> have arbitrary payload to index tuples instead of TID, and method to update
> that payload. But that would be quite big amount of work. Alternatively,
> we could allow pluggable storages to have their own index AMs, and that will
> move this amount of work to the pluggable storage side.

I agree with Robert that being able to store an arbitrary payload as a
TID is probably not going to ever work very well. However, I don't
think that that's a reason to give up on the underlying idea: creating
a new layer of indirection for secondary indexes, that allows updates
that now have to create new index tuples to instead just update the
indirection layer metadata.

You can create a mapping of a TID-like 6 byte integer to a primary key
value. Techniques exist. That seems a lot more practical. Of course,
TID is what is sometimes called a "physiological" identifier -- it has
a "physical" component (block number) and "logical" component (item
offset). Nothing I can think of prevents us from creating an
alternative, entirely logical identifier that fits in the same 6
bytes. It can map to a versioning indirection layer, for unique
indexes, or to a primary key value, for secondary indirect indexes.

[1] postgr.es/m/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Key_normalization#VACUUM_and_nbtree_page_deletion
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2017-07-16 01:47:22 Re: SCRAM auth and Pgpool-II
Previous Message Julien Rouhaud 2017-07-15 23:59:40 Re: segfault in HEAD when too many nested functions call