Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
Date: 2013-09-21 00:48:39
Message-ID: CAM3SWZSdjNTnUWhxxXKutSQ6beXKUc5j2rRno+32+jt2tHx2QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 17, 2013 at 9:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Sep 14, 2013 at 6:27 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> Note that today there is no guarantee that the original waiter for a
>> duplicate-inserting xact to complete will be the first one to get a
>> second chance

> ProcLockWakeup() only wakes as many waiters from the head of the queue
> as can all be granted the lock without any conflicts. So I don't
> think there is a race condition in that path.

Right, but what about XactLockTableWait() itself? It only acquires a
ShareLock on the xid of the got-there-first inserter that potentially
hasn't yet committed/aborted. There will be no conflicts between
multiple second-chance-seeking blockers trying to acquire this lock
concurrently, and so in fact there is (what I guess you'd consider to
be) a race condition in the current btree insertion code. So my
earlier point about according an upsert implementation license to
optimize ordering of retries across multiple unique indexes -- that it
isn't really inconsistent with the current code when dealing with only
one unique index insertion -- has not been invalidated.

EvalPlanQualFetch() and Do_MultiXactIdWait() also call
XactLockTableWait(), for similar reasons. In my patch, the later row
locking code used by INSERT...ON DUPLICATE KEY LOCK FOR UPDATE calls
XactLockTableWait() too.

>> So far it's
>> been a slippery slope type argument that can be equally well used to
>> argue against some facet of almost any substantial patch ever
>> proposed.
>
> I don't completely agree with that characterization, but you do have a
> point. Obviously, if the differences in the area of interruptibility,
> starvation, deadlock risk, etc. can be made small enough relative to
> the status quo can be made small enough, then those aren't reasons to
> reject the approach.

That all seems fair to me. That's the standard that I'd apply as a
reviewer myself.

> But I'm skeptical that you're going to be able to accomplish that,
> especially without adversely affecting maintainability. I think the
> way that you're proposing to use lwlocks here is sufficiently
> different from what the rest of the system does that it's going to be
> hard to avoid system-wide affects that can't easily be caught during
> code review;

Fair enough. In case it isn't already totally clear to someone, I
concede that it isn't going to be workable to hold even shared buffer
locks across all these operations. Let's get past that, though.

> and like Andres, I don't share your skepticism about
> alternative approaches.

Well, I expressed skepticism about one alternative approach in
particular, which is the promise tuples approach. Andres seems to
think that I'm overly concerned about bloat, but I'm not sure he
appreciates why I'm so sensitive to it in this instance. I'll be
particularly sensitive to it if value locks need to be held
indefinitely rather than there being a speculative
grab-the-value-locks attempt (because that increases the window in
which another session can necessitate that we retry at row locking
time quite considerably - see below).

> I think the fundamental
> problem with UPSERT, MERGE, and this proposal is what happens when the
> conflicting tuple is present but not visible to your scan, either
> because it hasn't committed yet or because it has committed but is not
> visible to your snapshot.

Yeah, you're right. As I mentioned to Andres already, when row locking
happens and there is this kind of conflict, my approach is to retry
from scratch (go right back to before value lock acquisition) in the
sort of scenario that generally necessitates EvalPlanQual() looping,
or to throw a serialization failure where that's appropriate. After an
unsuccessful attempt at row locking there could well be an interim
wait for another xact to finish, before retrying (at read committed
isolation level). This is why I think that value locking/retrying
should be cheap, and should avoid bloat if at all possible.

Forgive me if I'm making a leap here, but it seems like what you're
saying is that the semantics of upsert that one might naturally expect
are *arguably* fundamentally impossible, because they entail
potentially locking a row that isn't current to your snapshot, and you
cannot throw a serialization failure at read committed. I respectfully
suggest that that exact definition of upsert isn't a useful one,
because other snapshot isolation/MVCC systems operating within the
same constraints must have the same issues, and yet they manage to
implement something that could be called upsert that people seem happy
with.

> I also discovered, after it was committed, that it didn't help in the
> way I expected:
>
> http://www.postgresql.org/message-id/CA+TgmoY8P3sD=oUViG+xZjmZk5-phuNV39rtfyzUQxU8hJtZxw@mail.gmail.com

Well, at the time you didn't also provide raw commit latency benchmark
results for your hardware using a tool like pg_test_fsync, which I'd
consider absolutely essential to such a discussion. That's mostly or
entirely what the group commit stuff does - amortize that cost among
concurrently flushing transactions. Around this time, the patch was
said by Heikki to just relieve lock contention around WALWriteLock -
the 9.2 release notes say much the same. I never understood it that
way, though Heikki disagreed with that [1].

Certainly, if relieving contention was all the patch did, then you
wouldn't expect the 9.3 commit_delay implementation to help anyone,
but it does: with a slow fsync holding the lock 50% *longer* can
actually help tremendously. So I *always* agreed with you that there
was hardware where group commit would barely help with a moderately
sympathetic benchmark like the pgbench default. Not that it matters
much now.

> It's true that I didn't raise those concerns contemporaneously with
> the commit, but I didn't understand the situation well enough at that
> time to realize how narrow the benefit was.
>
> I've wished, on a number of occasions, to be able to add more lwlock
> primitives. The problem with that is that if everybody does it, we'll
> pretty soon end up with a mess.

I wouldn't go that far. The number of possible additional primitives
that are useful isn't that high, unless we decide that LWLocks are
going to be a fundamentally different thing, which I consider
unlikely.

> http://www.postgresql.org/message-id/CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com
>
> ...but nobody (including me) was very sure that was the right way
> forward, and it never went anywhere. However, I think the basic issue
> remains. I was sad to discover last week that Heikki handled this
> problem for the WAL scalability patch by basically copy-and-pasting
> much of the lwlock code and then hacking it up. I think we're well on
> our way to an unmaintainable mess already, and I don't want it to get
> worse. :-(

I hear what you're saying about LWLocks. I did follow the FlexLocks
stuff at the time myself. Obviously we aren't going to add new lwlock
operations if they have exactly no clients. However, I think that the
semantics implemented (weakening and strengthening of locks) may well
be handy somewhere else. So while I wouldn't go and commit that stuff
on the off chance that it will be useful, it's worth bearing in mind
going forward that it's quite possible to weaken/strengthen locks.

[1] http://www.postgresql.org/message-id/4FB0A673.7040002@enterprisedb.com

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-09-21 01:29:21 Re: Could ANALYZE estimate bloat?
Previous Message MauMau 2013-09-21 00:36:20 Re: UTF8 national character data type support WIP patch and list of open issues.