From: | Andrew Kroeger <andrew(at)sprocks(dot)gotdns(dot)com> |
---|---|
To: | Csaba Nagy <nagy(at)ecircle-ag(dot)com> |
Cc: | Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Marco Colombo <pgsql(at)esiway(dot)net>, Postgres general mailing list <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Delete/update with limit |
Date: | 2007-07-26 11:30:19 |
Message-ID: | 46A885CB.2050000@sprocks.gotdns.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Csaba Nagy wrote:
> On Tue, 2007-07-24 at 19:06, Stephan Szabo wrote:
>>> Unfortunately I don't think this will work. Multiple backends will happily
>>> pick up the same ctid in their selects and then try to delete the same
>>> records.
>> I'm pretty sure he said that the batch processing (and the delete) would
>> only be happening from one backend at a time, no concurrency on that
>> portion, merely concurrency with the large volume of inserts.
>
> Yes it's exactly like that... only it also happened accidentally that 2
> batch processes started at the same time, and they should not double
> process the data, nor loose some of it. The above scheme is OK with that
> too... but the array version from Tom is even better :-)
>
> Regarding the proposed mark/process/delete version, we've done it that
> way, and we always managed to get some corner case which lost us data...
> so even if it's possible to do it well, it's definitely not easy. The
> delete/copy/process private data version is much safer, and it actually
> can be done in one transaction to assure crash safety.
After reading through this thread, I have an idea that should accomplish
what I believe are your 3 main goals (avoid any negative performance
impact on the user's inserts, do not lose any data from those inserts,
and do not double process any data) and possibly improve performance
(I'm not sure what the overhead is for triggers, so there may not be any
benefit).
One-time changes that must be done before running the batch process:
A. Create a persistent table (I'll refer to this table as "stage") to
hold records that are pulled from the table that the user data is
inserted into (I'll refer to this table as "user"). This table will
have one extra column (say "orig_ctid") of type tid.
B. Remove the delete trigger from the user table.
The actual batch process:
1. Start an explicit transaction with serializable isolation level
2. Get an exclusive lock on the stage table. This will prevent any
other batch processes from running concurrently. The serializable
isolation level ensures that if a given batch process has to wait for
the lock, it will see all changes made by the prior batch process run.
3. Select records (including ctid) from user with a limit clause and
insert directly into stage:
insert into stage select *, ctid as orig_ctid from user limit 10;
4. Remove records from user that were just inserted into stage (not sure
of performance here, as it's not a self-join):
delete from user where ctid = any (array(select orig_ctid from stage));
5. Continue normal processing of records in the stage table.
6. Truncate the stage table.
7. Commit the transaction. This ensures all data is processed (and only
processed once).
With all that said, this isn't something I've actually done in PG. I've
done similar things in other other databases, so the concept should be
sound.
Hope this helps.
Andrew
From | Date | Subject | |
---|---|---|---|
Next Message | Csaba Nagy | 2007-07-26 11:41:05 | Re: Delete/update with limit |
Previous Message | Denis Gasparin | 2007-07-26 10:59:29 | Postgresql over a SAN |