Re: upsert doesn't seem to work..

From: Ben Morrow <ben(at)morrow(dot)me(dot)uk>
To: biertie(at)gmail(dot)com, pgsql-sql(at)postgresql(dot)org
Subject: Re: upsert doesn't seem to work..
Date: 2013-02-17 09:44:15
Message-ID: 20130217094411.GA28188@anubis.morrow.me.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Quoth biertie(at)gmail(dot)com (Bert):
>
> We continuously load data from flat files in our database.
> We first insert the data into unlogged tables (in the loadoltp schema), and
> then we use the 'upsert' statement to transfer the data from the load table
> into the tables we are going to use.
>
> The load tables are unlogged, and don't have indexes / pk's on them. All
> our 'real tables', which contains the data, always have a pk consisting out
> of 2 fields. In the example those are 'tick_server_id' and 'item_id'.
>
> At first everything seems to run ok, however it seems that new fields
> aren't always inserted as desired.
>
>
> This is an example query which causes troubles:

That query is basically equivalent to something like

create table "st_item" (
server_id integer,
item_id integer,
item_desc text,
primary key (server_id, item_id)
);
create table "st_item_insert" (
server_id integer,
item_id integer,
item_desc text
);

with "upsert" as (
update "st_item" et
set "item_desc" = e.item_desc
from "st_item_insert" e
where et.server_id = e.server_id
and et.item_id = e.item_id
returning et.server_id, et.item_id
)
insert into "st_item"
("server_id", "item_id", "item_desc")
select et.server_id, et.item_id, et.item_desc
from "st_item_insert" et
where et.server_id not in (
select et.server_id
from "upsert" b)
and et.item_id not in (
select et.item_id
from "upsert" b)

There are three problems here. The first is that the NOT IN subselect
selects from et instead of from b. In the context of this subselect "et"
is a table reference from outside the subselect, so it's treated as a
constant for each run of the subselect. That means that the subselect
will return the value you are testing against for every row in "upsert",
so if there were any updates at all you will make no insertions.

The second is that you are making two separate subselects. This means
that a row in st_item_insert will not be inserted if there is a row in
"upsert" with a matching server_id and a row in "upsert" with a matching
item_id, *even if they are different rows*. For instance, suppose
st_item_insert has

2 1 foo
1 2 bar
2 2 baz

and the 'foo' and 'bar' entries get updated. The 'baz' entry will then
not get inserted, because the first subselect will find the 'foo' row
and the second will find the 'bar' row. What you need is a single row
subselect, like this:

where (et.server_id, et.item_id) not in (
select "server_id", "item_id"
from "upsert")

The third is that upsert is not as simple as you think. It isn't
possible (at least, not in Postgres) to take a lock on a row which
doesn't exist, so it's possible that a concurrent transaction could
insert a row with a conflicting key between the time the UPDATE runs and
the time the INSERT runs. You need to either lock the whole table or use
the retry strategy documented in the 'Trapping Errors' section of the
PL/pgSQL documentation. Annoyingly, even 9's serializable transactions
don't seem to help here, at least not by my experiments.

Ben

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Ben Morrow 2013-02-17 10:50:59 Re: Perform Function When The Rows Of A View Change
Previous Message Ben Morrow 2013-02-17 07:58:59 Volatile functions in WITH