Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date: 2013-12-23 22:59:31
Message-ID: CAM3SWZSOdUmg4899tJc09R2uoRTYhb0VL9AasC1Fz7AW4GsR-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Dec 23, 2013 at 7:49 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I don't think this is a project to rush through. We've lived without
> MERGE/UPSERT for several years now, and we can live without it for
> another release cycle while we try to reach agreement on the way
> forward. I can tell that you're convinced you know the right way
> forward here, and you may be right, but I don't think you've convinced
> everyone else - maybe not even anyone else.

That may be. Attention from reviewers has been in relatively short
supply. Not that that isn't always true.

> I wouldn't suggest modeling anything you do on the way hash indexes
> using heavyweight locks. That is a performance disaster, not to
> mention being basically a workaround for the fact that whoever wrote
> the code originally didn't bother figuring out any way that splitting
> a bucket could be accomplished in a crash-safe manner, even in theory.
> If it weren't for that, we'd be using buffer locks there.

Having looked at the code for the first time recently, I'd agree that
hash indexes are a disaster. A major advantage of The Lehman and Yao
Algorithm, as prominently noted in the paper, is that exclusive locks
are only acquired on leaf pages to increase concurrency. Since I only
propose to extend this to a heavyweight page lock, and still only for
an instant, it seems reasonable to assume that the performance will be
acceptable for an initial version of this. It's not as if most places
will have to pay any heed to this heavyweight lock - index scans and
non-upserting inserts are generally unaffected. We can later optimize
performance as we measure a need to do so. Early indications are that
the performance is reasonable.

Holding value locks for more than an instant doesn't make sense. The
reason is simple: when upserting, we're tacitly only really looking
for violations on one particular unique index. We just lock them all
at once because the implementation doesn't know *which* unique index.
So in actuality, it's really no different from existing
potential-violation handling for unique indexes, except we have to do
some extra work in addition to the usual restart from scratch stuff
(iff we have multiple unique indexes).

> To be honest, I am still not altogether sold on any part of this
> feature. I don't like the fact that it violates MVCC - although I
> admit that some form of violation is inevitable in any feature in this
> area unless we're content to live with many serialization failures, I
> don't like the particular way it violates MVCC

Discussions around visibility issues have not been very useful. As
I've said, I don't like the term "MVCC violation", because it's
suggestive of some classical, codified definition of MVCC, a
definition that doesn't actually exist anywhere, even in research
papers, AFAICT. So while I understand your concerns around the
modifications to HeapTupleSatisfiesMVCC(), and while I respect that we
need to be mindful of the performance impact, my position is that if
that really is what we need to do, we might as well be honest about
it, and express intent succinctly and directly. This is a position
that is orthogonal to the proposed syntax, even if that is convenient
to my patch. It's already been demonstrated that yes, the MVCC
violation can be problematic when we call HeapTupleSatisfiesUpdate(),
which is a bug that was fixed by making another modest modification to
HeapTupleSatisfiesUpdate(). It is notable that that bug would have
still occurred had a would-be-HTSMVCC-invisible tuple been passed
through any other means. What problem, specifically, do you envisage
avoiding by doing it some other way? What other way do you have in
mind?

We invested huge effort into more granular FK locking when we had a
few complaints about it. I wouldn't be surprised if that effort
modestly regressed HeapTupleSatisfiesMVCC(). On the other hand, this
feature has been in very strong demand for over a decade, and has a
far smaller code footprint. I don't want to denigrate the FK locking
stuff in any way - it is a fantastic piece of work - but it's
important to have a sense of proportion about these things. In order
to make visibility work in the way we require, we're almost always
just doing additional checking of infomask bits, and the t_infomask
variable is probably already in a CPU register (this is a huge
simplification, but is essentially true). Like you, I have noted that
HeapTupleSatisfiesMVCC() is a fairly hot routine during profiling
before, but it's not *that* hot.

It's understandable that you raise these points, but from my
perspective it's hard to address your concerns without more concrete
objections.

> I don't like the
> syntax (returns rejects? blech)

I suppose it isn't ideal in some ways. On the other hand, it is
extremely flexible, with many of the same advantages of SQL MERGE.
Importantly, it will facilitate merging as part of conflict resolution
on multi-master replication systems, which I think is of considerable
strategic importance even beyond having a simple upsert.

I would like to see us get this into 9.4, and get MERGE into 9.5.

> and I don't like the fact that
> getting the locking right, or even getting the semantics right, seems
> to be so darned hard. I think we're in real danger of building
> something that will be too complex, or just too weird, for users to
> use, and too complex to maintain as well.

Please don't conflate confusion or uncertainty around alternative
approaches with confusion or uncertainty around mine - *far* more time
has been spent discussing the former. While I respect the instinct
that says we ought to be very conservative around changing anything
that the btree AM does, I really don't think my design is itself all
that complicated.

I've been very consistent even in the face of strong criticism. What I
have now is essentially the same design as back in early September.
After the initial ON DUPLICATE KEY IGNORE patch in August, I soon
realized that value locking and row locking could not be sensibly
considered in isolation, and over the objections of others pushed
ahead with integrating the two. I believe now as I believed then that
value locks need to be cheap to release (or it at least needs to be
possible), and that it was okay to drop all value locks when we need
to deal with a possible conflict/getting an xid shared lock - if those
unlocked pages have separate conflicts on our next attempt, the
feature is being badly misused (for upserting) or it doesn't matter
because we only need one conclusive "No" answer (for IGNOREing, but
also for upserting).

I have been trying to focus attention of these aspects throughout this
discussion. I'm not sure how successful I was here.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2013-12-24 01:05:23 trailing comment ghost-timing
Previous Message Tom Lane 2013-12-23 21:20:39 Re: WITHIN GROUP patch