From: | Peter Geoghegan <pg(at)heroku(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Andres Freund <andres(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Making joins involving ctid work for the benefit of UPSERT |
Date: | 2014-07-23 23:35:46 |
Message-ID: | CAM3SWZR1TA3UQQD=YNqa5LH4_2-oFqy8bacOk_q4=q0AzymBiw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jul 23, 2014 at 3:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> To the last question, yes. To the first point, I'm not bent on this
> particular syntax, but I am +1 for the idea that the command is
> something specifically upsert-like rather than something more generic
> like an ON CONFLICT SELECT that lets you do arbitrary things with the
> returned rows.
FWIW I wasn't proposing that you'd be able to do arbitrary things.
There'd be a few restrictions, such as forbidding unexpected DML
commands, and requiring that only a special update reference the
special rejected-projecting CTE. They just wouldn't be arbitrary
restrictions. But that's not that interesting to me anymore; you and
Andres want a dedicated DML command with the update part built in,
that isn't as flexible. Okay, fine. I don't think it makes the
implementation any easier, but that's what I'll do.
> It's certain arguable whether you should INSERT and then turn failures
> into an update or try to UPDATE and then turn failures into an INSERT;
> we might even want to have both options available, though that smells
> a little like airing too much of our dirty laundry. But I think I
> generally favor optimizing for the UPDATE case for more or less the
> same reasons Kevin articulated.
I don't see the connection between this and Kevin's remarks. And FWIW,
I don't see a reason to favor inserts or updates. Fortunately, what I
have balances both cases very well, and doesn't cause bloat. The work
of descending the index to lock it isn't wasted if an update is
required. My implementation decides to either insert or update at
literally the latest possible moment.
>> For one thing we cannot use any kind of scan unless there is a new
>> mechanism for seeing not-yet-visible tuples, like some kind of
>> non-MVCC snapshot that those scan nodes can selectively use. Now, I
>> guess that you intend that in this scenario you go straight to 5, and
>> then your insert finds the not-yet-visible conflict. But it's not
>> clear that when 5 fails, and we pick up from 1 that we then get a new
>> MVCC Snapshot or something else that gives us a hope of succeeding
>> this time around. And it seems quite wasteful to not use knowledge of
>> conflicts from the insert at all - that's one thing I'm trying to
>> address here; removing unnecessary index scans when we actually know
>> what our conflict TIDs are.
>
> Here you seem to be suggested that I intended to propose your existing
> design rather than something else, which I didn't. In this design,
> you find the conflict (at most one) but scanning for the tuple to be
> updated.
Yes, but what if you don't see a conflict because it isn't visible to
your snapshot, and then you insert, and only then (step 5), presumably
with a dirty snapshot, you find a conflict? How does the loop
terminate if that brings you back to step 1 with the same MVCC
snapshot feeding the update?
>>> 2. If you find more than one tuple that is visible to your scan, error.
>>
>> This point seems to concern making the UPSERT UPDATE only affect zero
>> or one rows. Is that right? If so, why do you think that's a useful
>> constraint to impose?
>
> Because nobody wants an operation to either insert 1 tuple or update
> n>=1 tuples. The intention is that the predicate should probably be
> something like WHERE unique_key = 'some_value', but you can use
> something else. So it's kinda like saying which index you care about
> for a particular operation, except without having to explicitly name
> an index. But in any event you should use a predicate that uniquely
> identifies the tuple you want to update.
I agree that you want to uniquely identify each tuple. What I meant
was, why should we not be able to upsert multiple rows in a single
command? What's wrong with that?
>> Point 4b ("if there is more than one such tuple...") seems like it
>> needs some practical or theoretical justification - do you just not
>> want to follow an update chain? If so, why? What would the error
>> message actually be? And, to repeat myself: Why should an MVCC
>> snapshot see more than one such tuple in the ordinary course of
>> scanning a relation, since there is by definition a unique constraint
>> that prevents this? Or maybe I'm wrong; maybe you don't even require
>> that. That isn't clear.
>
> In case (4b), like in case (2), you've done something like UPSERT tab
> SET ... WHERE non_unique_index = 'value_appearing_more_than_once'.
> You shouldn't do that, because you have only one replacement tuple
> (embodied in the SET clause).
Oh, I totally agree that we should throw an error if any one row is
affected more than once by the same command. Indeed, SQL MERGE
requires this, since to do otherwise would leave the final value of
the row unspecified. It seemed to me from your original explanation of
point 4b that you were saying something much stronger. But if that's
all you meant, then I agree.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2014-07-23 23:36:43 | Re: PDF builds broken again |
Previous Message | Alvaro Herrera | 2014-07-23 23:32:13 | Re: Making joins involving ctid work for the benefit of UPSERT |