Re: Proposal for CSN based snapshots

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Markus Wanner <markus(at)bluegap(dot)ch>
Cc: Ants Aasma <ants(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal for CSN based snapshots
Date: 2013-06-07 12:50:31
Message-ID: CA+CSw_uSzXidemci-nU5fJY-qXD0DE0ZWz-j1ASTT_ffBjXo9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 7, 2013 at 2:59 PM, Markus Wanner <markus(at)bluegap(dot)ch> wrote:
>> To refresh your memory the basic idea is to change visibility
>> determination to be based on a commit sequence number (CSN for short)
>> - a 8 byte number incremented on every commit representing the total
>> ordering of commits. To take a snapshot in this scheme you only need
>> to know the value of last assigned CSN, all transactions with XID less
>
> You mean with a *CSN* less than or equal to that number? There's
> certainly no point in comparing a CSN with an XID.

That was what I meant. I guess my coffee hadn't kicked in yet there.

>> than or equal to that number were commited at the time of the
>> snapshots, everything above wasn't committed. Besides speeding up
>> snapshot taking, this scheme can also be a building block for
>> consistent snapshots in a multi-master environment with minimal
>> communication.
>
> Agreed. Postgres-R uses a CommitOrderId, which is very similar in
> concept, for example.

Do you think having this snapshot scheme would be helpful for Postgres-R?

>> Because snapshots can remain active for an unbounded amount of time
>> and there can be unbounded amount of active snapshots, even the sparse
>> mapping can fill up.
>
> I don't think the number of active snapshots matters - after all, they
> could all refer the same CSN. So that number shouldn't have any
> influence on the size of the sparse mapping.
>
> What does matter is the number of transactions referenced by such a
> sparse map. You are of course correct in that this number is equally
> unbounded.

Yes, that is what I meant to say but for some reason didn't.

>> To handle that case, each backend advertises its
>> lowest snapshot number csn_min. When values need to be evicted from
>> the sparse mapping, they are evicted in CSN order and written into the
>> CSN log - a series of CSN-XID pairs. Backends that may still be
>> concerned about those values are then notified that values that they
>> might need to use have been evicted. Backends with invalidated
>> snapshots can then convert their snapshots to regular list of
>> concurrent XIDs snapshots at their leisure.
>>
>> To convert a CSN based snapshot to XID based, a backend would first
>> scan the shared memory structures for xids up to snapshot.xmax for
>> CSNs that are concurrent to the snapshot and insert the XIDs into the
>> snapshot, then read in the CSN log starting from snapshots CSN,
>> looking for xid's less than the snapshots xmax. After this the
>> snapshot can be handled like current snapshots are handled.
>
> Hm, I dislike the requirement to maintain two different snapshot formats.
>
> Also mind that snapshot conversions - however unlikely you choose to
> make them - may well result in bursts as multiple processes may need to
> do such a conversion, all starting at the same point in time.

I agree that two snapshot formats is not great. On the other hand, the
additional logic is confined to XidInMVCCSnapshot and is reasonably
simple. If we didn't convert the snapshots we would have to keep
spooling out CSN log and look up XIDs for each visibility check. We
could add a cache for XIDs that were deemed concurrent, but that is in
effect just lazily constructing the same datastructure. The work
needed to convert is reasonably well bounded, can be done without
holding global locks and in most circumstances should only be
necessary for snapshots that are used for a long time and will
amortize the cost. I'm not worried about the bursts because the
conversion is done lockless and starting at the same point in time
leads to better cache utilization.

>> Rolling back a transaction
>> --------------------------
>>
>> Rolling back a transaction is basically the same as committing, but
>> the CSN slots need to be stamped with a AbortedCSN.
>
> Is that really necessary? After all, an aborted transaction behaves
> pretty much the same as a transaction in progress WRT visibility: it's
> simply not visible.
>
> Or why do you need to tell apart aborted from in-progress transactions
> by CSN?

I need to detect aborted transactions so they can be discared during
the eviction process, otherwise the sparse array will fill up. They
could also be filtered out by cross-referencing uncommitted slots with
the procarray. Having the abort case do some additional work to make
xid assigment cheaper looks like a good tradeoff.

>> Sparse buffer needs to be at least big enough to fit CSN slots for the
>> xids of all active transactions and non-overflowed subtransactions. At
>> the current level PGPROC_MAX_CACHED_SUBXIDS=64, the minimum comes out
>> at 16 bytes * (64 + 1) slots * 100 = backends = 101.6KB per buffer,
>> or 203KB total in the default configuration.
>
> A CSN is 8 bytes, the XID 4, resulting in 12 bytes per slot. So I guess
> the given 16 bytes includes alignment to 8 byte boundaries. Sounds good.

8 byte alignment for CSNs is needed for atomic if not something else.
I think the size could be cut in half by using a base value for CSNs
if we assume that no xid is active for longer than 2B transactions as
is currently the case. I didn't want to include the complication in
the first iteration, so I didn't verify if that would have any
gotchas.

>> Performance discussion
>> ----------------------
>>
>> Taking a snapshot is extremely cheap in this scheme, I think the cost
>> will be mostly for publishing the csnmin and rechecking validity after
>> it. Taking snapshots in the shadow of another snapshot (often the case
>> for the proposed MVCC catalog access) will be even cheaper as we don't
>> have to advertise the new snapshot. The delayed XID based snapshot
>> construction should be unlikely, but even if it isn't the costs are
>> similar to GetSnapshotData, but we don't need to hold a lock.
>>
>> Visibility checking will also be simpler as for the cases where the
>> snapshot is covered by the dense array it only requires two memory
>> lookups and comparisons.
>
> Keep in mind, though, that both of these lookups are into shared memory.
> Especially the dense ring buffer may well turn into a point of
> contention. Or at least the cache lines holding the most recent XIDs
> within that ring buffer.
>
> Where as currently, the snapshot's xip array resides in process-local
> memory. (Granted, often enough, the proc array also is a point of
> contention.)

Visibility checks are done lock free so they don't cause any
contention. The number of times each cache line can be invalidated is
bounded by 8. Overall I think actual performance tests are needed to
see if there is a problem, or perhaps if having the data shared
actually helps with cache hit rates.

>> At this point I don't see any major issues with this approach. If the
>> ensuing discussion doesn't find any major showstoppers then I will
>> start to work on a patch bit-by-bit.
>
> Bit-by-bit? Reminds me of punched cards. I don't think we accept patches
> in that format. :-)
>
>> we have a small one in the family.
>
> Congratulations on that one.

Thanks,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ants Aasma 2013-06-07 12:54:16 Re: Proposal for CSN based snapshots
Previous Message Greg Stark 2013-06-07 12:47:29 Re: Proposal for CSN based snapshots