INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date: 2013-09-09 05:21:47
Message-ID: CAM3SWZThwrKtvurf1aWAiH8qThGNMZAfyDcNw8QJu7pqHk5AGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The attached patch implements INSERT...ON DUPLICATE KEY LOCK FOR
UPDATE. This is similar to INSERT...ON DUPLICATE KEY IGNORE (which is
also proposed at part of this new revision of the patch), but
additionally acquires a row exclusive lock on the row that prevents
insertion from proceeding in respect of some tuple proposed for
insertion.

This feature offers something that I believe could be reasonably
described as upsert. Consider:

postgres=# create table foo(a int4 primary key, b text);
CREATE TABLE
postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 0

Here there are 0 rows affected by the update, because all work was
done in the insert. If l do it again 2 rows are affected by the
update:

postgres=# with r as (
insert into foo(a,b)
values (5, '!'), (6, '@')
on duplicate key lock for update
returning rejects *
)
update foo set b = r.b from r where foo.a = r.a;
UPDATE 2

Obviously, rejects were now projected into the wCTE, and the
underlying rows were locked. The idea is that we can update the rows,
confident that each rejection-causing row will be updated in a
race-free fashion. I personally prefer this to something like MySQL's
INSERT...ON DUPLICATE KEY UPDATE, because it's more flexible. For
example, we could have deleted the locked rows instead, if that
happened to make sense. Making this kind of usage idiomatic feels to
me like the Postgres way to do upsert. Others may differ here. I will
however concede that it'll be unfortunate to not have some MySQL
compatibility, for the benefit of people porting widely used web
frameworks.

I'm not really sure if I should have done something brighter here than
lock the first duplicate found, or if it's okay that that's all I do.
That's another discussion entirely. Though previously Andres and I did
cover the question of prioritizing unique indexes, so that the most
sensible duplicate for the particular situation was returned,
according to some criteria.

As previously covered, I felt that including a row locking component
was essential to reasoning about our requirements for what I've termed
"speculative insertion" -- the basic implementation of value locking
that is needed to make all this work. As I said in that earlier
thread, there are many opinions about this, and it isn't obvious which
one is right. Any approach needs to have its interactions with row
locking considered right up front. Those that consider this a new
patch with new functionality, or even a premature expansion on what
I've already posted should carefully consider that. Do we really want
to assume that these two things are orthogonal? I think that they're
probably not, but even if that happens to turn out to have been not
the case, it's an unnecessary risk to take.

Row locking
==========

Row locking is implemented with calls to a new function above
ExecInsert. We don't bother with the usual EvalPlanQual looping
pattern for now, preferring to just re-check from scratch if there is
a concurrent update from another session (see comments in
ExecLockHeapTupleForUpdateSpec() for details). We might do better
here. I haven't considered the row locking functionality in too much
detail since the last revision, preferring to focus on value locking.

Buffer locking/value locking
======================

Andres raised concerns about the previous patch's use of exclusive
buffer locks for extended periods (i.e. during a single heap tuple
insertion). These locks served as extended value locks. With this
revision, we don't hold exclusive buffer locks for the duration of
heap insertion - we hold shared buffer locks instead. I believe that
Andres principal concern was the impact on concurrent index scans by
readers, so I think that all of this will go some way towards
alleviating his concerns generally.

This necessitated inventing entirely new LWLock semantics around
"weakening" (from exclusive to shared) and "strengthening" (from
shared to exclusive) of locks already held. Of course, as you'd
expect, there are some tricky race hazards surrounding these new
functions that clients need to be mindful of. These have been
documented within lwlock.c.

I looked for a precedent for these semantics, and found a few. Perhaps
the most prominent was Boost, a highly regarded, peer-reviews C++
library. Boost implements exactly these semantics for some of its
thread synchronization/mutex primitives:

http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.mutex_concepts.upgrade_lockable

They have a concept of upgradable ownership, which is just like shared
ownership, except, I gather, that the owner reserves the exclusive
right to upgrade to an exclusive lock (for them it's not quite an
exclusive lock; it's an upgradeable/downgradable exclusive lock). My
solution is to push that responsibility onto the client - I admonish
something along the lines of "don't let more than one shared locker do
this at a time per LWLock". I am of course mindful of this caveat in
my modifications to the btree code, where I "weaken" and then later
"strengthen" an exclusive lock - the trick here is that before I
weaken I get a regular exclusive lock, and I only actually weaken
after that when going ahead with insertion.

I suspect that this may not be the only place where this trick is helpful.

This intended usage is described in the relevant comments added to lwlock.c.

Testing
======

This time around, in order to build confidence in the new LWLock
infrastructure for buffer locking, on debug builds we re-verify that
the value proposed for insertion on the locked page is in fact not on
that page as expected during the second phase, and that our previous
insertion point calculation is still considered correct. This is kind
of like the way we re-verifying the wait-queue-is-in-lsn-order
invariant in syncrep.c on debug builds. It's really a fancier
assertion - it doesn't just test the state of scalar variables.

This was invaluable during development of the new LWLock infrastructure.

Just as before, but this time with just shared buffer locks held
during heap tuple insertion, the patch has resisted considerable
brute-force efforts to break it (e.g. using pgbench to get many
sessions speculatively inserting values into a table. Many different
INSERT... ON DUPLICATE KEY LOCK FOR UPDATE statements, interspersed
with UPDATE, DELETE and SELECT statements. Seeing if spurious
duplicate tuple insertions occur, or deadlocks, or assertion
failures).

As always, isolation tests are included.

Bugs
====

I fixed the bug that Andres reported in relation to multiple exclusive
indexes' interaction with waits for another transaction's end during
speculative insertion.

I did not get around to fixing the broken ecpg regression tests, as
reported by Peter Eisentraut. I was a little puzzled by the problem
there. I'll return to it in a while, or perhaps someone else can
propose a solution.

Thoughts?

--
Peter Geoghegan

Attachment Content-Type Size
insert_on_dup.v2.2013_09_08.patch.gz application/x-gzip 33.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Vesa-Matti J Kari 2013-09-09 06:34:38 Strange hanging bug in a simple milter
Previous Message Amit Kapila 2013-09-09 03:49:03 Re: [PERFORM] encouraging index-only scans