Re: How can I prevent duplicate inserts by multiple concurrent threads ?

From: Daniel Staal <DStaal(at)usa(dot)net>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: How can I prevent duplicate inserts by multiple concurrent threads ?
Date: 2015-01-07 20:31:20
Message-ID: EC059D342FB2C94591E50167@[192.168.1.50]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

--As of January 7, 2015 3:01:42 PM -0500, Hursh Jain is alleged to have
said:

> Right, maybe that's what I should be doing. The reason I didn't do that
> initially was because creating new data can be *very* expensive. We have
> about a 100 different chunks and if say 10 threads hit the database and
> each thread re-creates 100 chunks, we will end up trying to save 100*10
> (1000) chunks, 900 of which will be rightly rejected because of the
> uniqueness constraint, so that time spent in creating those chunks would
> have been wasted to the annoyance of the clients.
>
> It would be nice to see if that data already exists before we try to
> re-insert it....but how ? the "select for update" doesn't seem to be
> working as I hoped..

--As for the rest, it is mine.

It won't completely solve the issue, but take a look at savepoints and
transactions. My flow would probably be:

Start transaction
Select row to see if data exists
If so, skip
If not, create data and insert
End transaction

If you can generate the key without generating the data, I'd probably do:

Start transaction
Select row to see if data exists
If not, insert key, else end transaction
Set savepoint
Create data.
Update to key, rolling back to savepoint and restarting if there is an
issue.
End transaction

(Though the 'select row' and 'insert key' of the second could probably be
combined to just 'insert key'.)

Each transaction will get you an internally-consistent view of the
database, so the select will be on the same dataset you eventually insert
into.

My other question from your comments above is if this should all actually
be in *one* table? If you are collecting a lot of different information
from different sources, it might work better to have tables with that data,
and just reference them with this table - which would be quicker and
cheaper to update then. (I obviously don't know the situation - I
acknowledge that you might to be able to do that for any of a hundred
reasons, or that the gathering and correlating might be generating
something more unique to each row. It's just a thought from the simplified
description you've given.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message avpro avpro 2015-01-08 15:45:09 partial text search
Previous Message Hursh Jain 2015-01-07 20:01:42 Re: How can I prevent duplicate inserts by multiple concurrent threads ?