From: | Chris Angelico <rosuav(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Lock/deadlock issues with priority queue in Postgres - possible VACUUM conflicts |
Date: | 2012-01-09 03:58:02 |
Message-ID: | CAPTjJmre8jFOj2Lx3O01qe+-qqdGsY4iR5eOqzxN20-HCXYFJw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
This has got to be a solved problem, but I can't find a simple example
of what to do, nor a description of what I'm doing wrong.
In this project, we need to have a disk-based priority queue, with a
few snazzy features (eg limited retries and scheduling) but nothing
particularly exotic. Requests are added to the queue by any number of
different processes, and it's not a problem if a request "misses the
check" and is delayed by a few seconds. Requests are processed by a
pool of workers which are unaware of each other (and may be running on
different computers); the sole arbiter is the database itself.
Everything's currently on a single Debian 6.0.3 laptop, with Postgres
(the .deb package from openscg.org) set to max_connections = 5100, and
the kernel reconfigured to allow 1GB shared memory and "500 32000 32
1024" semaphores. 'select version()' gives:
PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
The current (PHP) implementation involves a 'claimed' column (NULL for
unclaimed, otherwise is an integer worker id - this allows for the
detection of crashed workers) and the following query:
"update interfaceout set
claimed=$workerid,lastretry=now(),retrycount=retrycount+1 where
claimed is null and id=(select id from interfaceout where
claimed is null
and not done
and scheduletime<=now()
order by priority,random() limit 1) returning *"
This query is performed (in autocommit mode), then some (currently
minimal) processing is done, and then another query is done to "close
off" the call:
"update interfaceout set claimed=null,done=$1,response=$2,success=$3
where id=$4"
with parameters set according to whether or not the call was completed
(if not it gets retried), and the results of the processing. Another
process is then signalled to handle the response, and that process
will then delete the row.
With one single worker process, this works fine; but with 16K calls
and 50 workers (not ridiculously huge numbers by any means), lock
contention becomes a major issue, and there are even occasional
deadlocks. Performance plummets. I'm not entirely sure, but it seems
that problems start happening when autovacuum kicks in.
Monitoring the pg_locks table (joining with pg_stat_activity to see
current query in progress) shows that the time is all being spent on
the first query, attempting to set claimed; the other queries almost
never show up in the list.
Researching the issue brought up a couple of good links:
http://stackoverflow.com/questions/389541/select-unlocked-row-in-postgresql
http://blog.endpoint.com/2008/12/parallel-inventory-access-using.html
http://wiki.postgresql.org/wiki/PGQ_Tutorial
PGQ looks promising, but I can't afford the risk of losing calls in
the event that there are no workers to process them (the correct
action is for them simply to languish in the database until one is
started up).
The StackOverflow thread lists a number of excellent ideas, most of
which are already implemented in the above code, or are unnecessary
(for instance, each worker needs only acquire one row - no need to
grab the top N). There's some interesting stuff on advisory locks; is
this the only way to do this safely?
This is surely something that has been done before. Can anyone point
me to a good tutorial on the subject?
Thanks!
Chris Angelico
From | Date | Subject | |
---|---|---|---|
Next Message | Damiano ALBANI | 2012-01-09 08:56:00 | Re: Supporting SQL/MED DATALINK |
Previous Message | Ondrej Ivanič | 2012-01-09 01:54:51 | Re: Time to move table to new tablespace |