Re: [EXTERNAL]: Re: UPSERT in Postgres

From: Louis Tian <louis(dot)tian(at)aquamonix(dot)com(dot)au>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: [EXTERNAL]: Re: UPSERT in Postgres
Date: 2023-04-07 00:49:35
Message-ID: SY4P282MB10529BE09B553228FAA71163A6919@SY4P282MB1052.AUSP282.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Peter,

Thanks for your reply. Appreciate the help and discussion.

> In general UPSERT (or any definition of it that I can think of) does
> not imply idempotency.

"Idempotence is the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application." from Wikipedia.
the concept of Idempotence when applies to HTTP is consistent with the above. https://developer.mozilla.org/en-US/docs/Glossary/Idempotent. Or are you going by a different defintion that I am not aware of?
If you execute the same upsert multiple times, the state of the database remains the same as if only execute once.
If a row already exists, the first statement will update the row so does any subsequent statements. executing the same update multiple time is the same as executing it only once.
If the row doesn't exist, the first statement will insert that row and any subsequent will try to update, but the update has no real effect since it the value is exactly the same as the insert.
So by defintion, upsert is idempotent.

> It could just be a unique index or a unique constraint. So you can
> upsert on any individual unique constraint/index, or the primary key.
> Of course there might be several on a given table, but you can only
> use one as the "conflict arbiter" per statement.

Understand that I can use any unique constraint with on conflict. 
But semantically the only correct one is the primary key, since that's what identifies a row logically.
In that sense, any unique column(s) is a potential candidate for primary key.
It's more of a pedantic point rather than pragmatic one.
It's less of a problem for PostgreSQL where the semantic importance of primary key is not manifested at implementation level, since all index points to the tuple directly
Whereas it is more import for Databaes like MySQL where the secondary index points to the primary key index.

> It sort of has to work that way, though. In general your example might
> *not* fail, due to a row-level before trigger in the insert path.

Do you mean having the trigger to check whether new."name" is set if not use the old."name" as fallback ?

> Why doesn't your proposed upsert syntax have the same problem? I mean,
> how could it not? I guess it doesn't if you assume that it'll never
> take the insert path with your not NULL constraint example?

Use some pesudo code might be helpful here to explain the difference. 

How on conflict works at the moment.

try {
 insert row
} catch (duplicated key error) {
  update row
}

How I think it upsert should work

if (new.id exists) {
 update row 
} else {
 insert row
}

I would argue that later is a correct form of upsert given it's definition. 
The two are not equivalent when there is not null constraint on any non primary key column.

> But if you know that for sure, why not just use a regular update statement? 
Yes, in general it is not know whether the insert or update path should be taken. 

> On the other hand, if you're not sure if the insert path can be taken,
> then why is it actually helpful to not just throw an error at the
> earliest opportunity?
I am not expecting an error here. The problem is with no conflict it always go down the insert path first and results in a not null constraint error.
While I am expecting the insert is never executed in the first place when that row already exist (as identified by it primary key). So the update execute without error.
I hope the pesudo code above is enough to clarify the difference?

> The MERGE command has various race conditions that are particularly
> relevant to UPSERT type use cases. See the wiki page you referenced
> for a huge amount of information on this.
Thanks for the pointer. Reading into it.

Cheers,
Louis Tian

From: Peter Geoghegan <pg(at)bowt(dot)ie>
Sent: Friday, April 7, 2023 6:58 AM
To: Louis Tian <louis(dot)tian(at)aquamonix(dot)com(dot)au>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: [EXTERNAL]: Re: UPSERT in Postgres
 
CAUTION: This email originated from outside of Envirada. Do not click links or open attachments unless you recognize the sender and know the content is safe.

On Thu, Apr 6, 2023 at 1:21 PM Louis Tian <louis(dot)tian(at)aquamonix(dot)com(dot)au> wrote:
> An implicit assumption behind this definition is that table must have a primary key for the upsert operation to make sense since it's the primary key that uniquely identifies a row.

It could just be a unique index or a unique constraint. So you can
upsert on any individual unique constraint/index, or the primary key.
Of course there might be several on a given table, but you can only
use one as the "conflict arbiter" per statement.

> But can we achieve the same effect with "INSERT ... ON CONFLICT DO UPDATE" like a lot of references on the internet seems to suggest.
>
> insert into person (id, name) values (0, 'foo') on conflict ("id") do update set id=excluded.id, name=excluded.name
> insert into person (id, is_active) values (0, true) on conflict ("id") do update set id=excluded.id, is_active=excluded.is_active
>
> Unfortunately. the second statement will fail due to violation of the not null constraint on the "name" column.
> PostgreSQL will always try to insert the row into the table first. and only fallback to update when the uniqueness constraint is violated.
> Is this behavior wrong? maybe not, I think it is doing what it reads quite literally.

It sort of has to work that way, though. In general your example might
*not* fail, due to a row-level before trigger in the insert path.

Why doesn't your proposed upsert syntax have the same problem? I mean,
how could it not? I guess it doesn't if you assume that it'll never
take the insert path with your not NULL constraint example? But if you
know that for sure, why not just use a regular update statement? On
the other hand, if you're not sure if the insert path can be taken,
then why is it actually helpful to not just throw an error at the
earliest opportunity?

Surely upsert means "update or insert", so why wouldn't the user expect
to see an error like this, independent of the specifics of the row in question?
Isn't the user tacitly saying "I don't specifically know if the update or insert
path will be taken in respect of any given row" by using ON CONFLICT
DO UPDATE in the first place?

> That being said, I have never had a need for the ON CONFLICT DO UPDATE statement other than where I need upsert.
> But using it as "upsert" is only valid when the table is absent of any NOT NULL constraint on it's non primary key columns.

I don't know what you mean by that. "Valid"?

> The MERGE command introduced in PG15 in theory can be used to do UPSERT properly that is void of the aforementioned limitation.
> The downside is it is rather verbose.

The MERGE command has various race conditions that are particularly
relevant to UPSERT type use cases. See the wiki page you referenced
for a huge amount of information on this.

> *Feature Request*
> Given that UPSERT is an *idempotent* operator it is extremely useful.

In general UPSERT (or any definition of it that I can think of) does
not imply idempotency.

--
Peter Geoghegan
--
Message  protected by MailGuard: e-mail anti-virus, anti-spam and content filtering.https://www.mailguard.com.au/mg
Click here to report this message as spam:
https://console.mailguard.com.au/ras/259w6P5THe/6Ld9hKnxHU77IFkjdp0Xsh/0.1

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Loftis 2023-04-07 00:53:29 Re: Backup schema without data
Previous Message Atul Kumar 2023-04-07 00:40:30 Backup schema without data