Re: using a postgres table as a multi-writer multi-updater queue

From: Tim Uckun <timuckun(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: John R Pierce <pierce(at)hogranch(dot)com>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: using a postgres table as a multi-writer multi-updater queue
Date: 2015-11-25 20:50:33
Message-ID: CAGuHJrNiALNZA58OmCCMQoOxz5R=m0reQrT8xN2RksvogCzyyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'll add my two cents.....

I set up something similar a while ago. Here are my suggestions for what
they are worth.

You don't ever want to delete from such a table so you need to set up
something which allows you to truncate the tables when you no longer need
them.

One way to accomplish this is with rules (not triggers, rules are blazingly
fast compared to triggers). Set up a table inheritance scheme based on
whatever search criteria you have (date, sequence etc). Set up a cron job
to create the child tables well ahead and to set up the rule redirecting
the insert. For example let's say you have a date partitioned table and you
want to keep a table for every day. Your cron job would run once a day
and would create the next seven days worth of tables (just incase the cron
job fails to run for some reason) and would rewrite the rule to insert into
the table with a if then else type of logic. This method is preferable to
the dynamic creation of the table name with string concatenation because
again it's significantly faster.

Another method I tried was to have one "primary" child table and "archival"
child tables and insert directly into the primal child table. For example
say you have a table called "Q". You set up a table called Q_in which
inherits from Q. Your code inserts into the Q_in table, you select from
the Q table. On a periodic basis you do this

BEGIN TRANSACTION
LOCK TABLE Q_IN IN EXCLUSIVE MODE;
ALTER TABLE Q_IN RENAME TO Q_SOME_DATETIME;
CREATE TABLE Q_IN
(LIKE Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING
INDEXES);
ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id;
-- Set some constraints so the query optimizer knows what to do
END TRANSACTION

There is one other method which is the Kafka approach( You can use this in
addition to the above methods)

Create N tables for incoming queue, each one has a sequence for their ID
number. N should be determined by how many clients you expect to run.
Create a rule which uses some hash function or round robin or randomly to
insert data into one of the tables. Create a different table which keeps
track of client connections. The clients use this table to keep track of
the last id fetched. For example let's have I have three types of
processes that run on the incoming data p1,p2, p3 The table logs the
highest ID fetched from each table for each process. When a client connects
it connects to the table with the lowest used ID for that process, and it
locks it for that client (not process because you can multiple clients
running each process), it processes the records, it updates the id, it
unlocks the table and it backs off for a few seconds. The next client
which woke up goes through the same process and so on. Both Apache Kafka
and Amazon kinesis use this approach. One nice thing about this approach
is that you can put each table in it's own tablespace in it's own disk for
higher performance.

One other thing. This is dangerous but you can turn off logging of the
tables, this will make the inserts much faster.

Finally:

There is no need to do any of this. Kinesis is cheap, Kafka is pretty
awesome, Rabbit is crazy useful.

Hope this helps, ping me offline if you want more details.

Cheers.

On Tue, Nov 24, 2015 at 11:51 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce <pierce(at)hogranch(dot)com>
> wrote:
> > On 11/23/2015 2:41 AM, Chris Withers wrote:
> >>
> >>
> >> If it's totally wrong, how should I be looking to approach the problem?
> >
> > depending on where these queue entries are coming from, I'd considering
> > using a message queueing system like AMS, MQseries, etc, rather than
> trying
> > to use a relational database table as a queue. your external data
> source(s)
> > would write messages to this queue, and you'd have 'subscriber' processes
> > that listen to the queue and process the messages, inserting persistent
> data
> > into the database as needed.
>
> I just don't agree with this generalization. Keeping the state of the
> queue in the database has a lot of advantages and is a lot easier to
> deal with from a programming perspective especially if SQL is your
> core competency. Being able to produce and consume in SQL based on
> other relational datasources is...elegant.
>
> Specialized queue systems are a very heavy dependency and adding a new
> server to your platform to mange queues is not something to take
> lightly. This advice also applies to scheduling systems like quartz,
> specialized search like solr and elastisearch, and distributed data
> platforms like hadoop. I've used all of these things and have tended
> to wish I had just used the database instead in just about every case.
>
> Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
> Personally, I tend to roll my own queues. It's not difficult.
>
> merlin
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Melvin Davidson 2015-11-25 21:02:23 Re: Taking lot time
Previous Message Adrian Klaver 2015-11-25 20:50:11 Re: Taking lot time