Re: High-Concurrency GiST in postgreSQL

From: "C(dot) Mundi" <cmundi(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: High-Concurrency GiST in postgreSQL
Date: 2011-12-05 21:34:00
Message-ID: CAPvS8Wa5trL6a+ndHhtbzAqSv5H_Kik8ZMoGtCqerx6Q9GUNAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Mon, Dec 5, 2011 at 12:26 PM, Andy Colson <andy(at)squeakycode(dot)net> wrote:

> On 12/5/2011 12:31 PM, C. Mundi wrote:
>
>>
>> Hello. This is my first post. As such, feedback on style and choice of
>> venue are especially welcome.
>>
>> I am a regular but not especially expert user of a variety of databases,
>> including postgreSQL.
>> I have only modest experience with spatial databases.
>>
>> I have a new project[1] in which GiST could be very useful, provided I
>> can achieve high concurrency.<SNIP>
>>
>
> concurrency here can mean different things. One application hitting PG
> which then uses multiple threads? (Not currently possible) Or one app with
> multiple threads each having a database connection? (Which is really the
> same as) Multiple app's each having a database connection?
>
> PG limits one database connection to one cpu. Multiple connections will
> use multiple cpu.
>
> OR, by concurrency, do you mean, non-blocking? And if you mean
> non-blocking, is that for read's, write's, or both?
>
> In PG you can do non-blocking, multiple connections (ie multiple cpu),
> reads as much as you want.
>
> Extending to indexes: many connections can read a gist index at the same
> time. Is that what you need?
>
> -Andy
>

Thanks, Andy. You're quite right of course. I'm thinking of concurrent
clients. Lots of them. I envision thousands of actors (which could be
threads within a process or separate processes) behaving as clients, each
with its own connection to a single-threaded database server. So
concurrency here simply means that the database has to be able to handle a
lot of "simultaneous" connections coming at it fast and asynchronously.
(I'm prepared to deploy a thin queuing layer if it turns out that I
saturate the server.) The compute nodes are doing heavy physics which
depends on spatially organized data, and it is very difficult to predict
which rows an actor will need next. (In fact, knowing that would
presuppose that the problem to be solved could be factored at great savings
in computation.)

So what I really need is minimal locking, as in [Karnacker and Banks 1995].
The whole database can be pre-loaded before the start of the calculation.
Now about 80% of the data in the database will never change during a run.
But about 20% will change via "feedback" from the compute nodes. And the
nature of the problem is that we do not know *a priori* which data is in
the 80% and which is in the 20%. If we did, we could split the database to
ensure no block-on-write impact on reads for the 80%. Alas, we have to
assume that reads and writes are mixed with statistics yet unknown.

The 20% which changes changes spatially, not just in content. This can
lead to the need to rebalance on inserts. And since splitting nodes in
"naive" R* trees is kind of expensive [1], I am wondering to what extent
the "sibling links" approach described by Karnacker and Banks has -- as
anticipated by Hellerstein *et al.* -- already been implemented in GiST in
postgreSQL. If it has, then I win just by using the 'cube' contrib
extension. Hellerstein notes that sibling pointers have long been in
common use even in B-trees; so I am optimistic for GiST.

So that's my concern. I'm doing 80% reads which are all non-blocking with
20% writes mixed in, and I need to avoid the effect of writes blocking
queries which do not need to traverse branches affected by the write.

Thanks!
Carlos

[1] Even if inserts were not potentially expensive for the database, the
prospect that an insert triggered by one compute node could occasionally
cause *all* the compute to stall when not logivally necessary is horrifying.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2011-12-05 21:41:47 Re: High-Concurrency GiST in postgreSQL
Previous Message Tom Lane 2011-12-05 20:00:57 Re: disallow SET WORK_MEM