Quick Links

Re: idea for a geographically distributed database: how best to implement?

From:	Andy Ballingall <andy(at)areyoulocal(dot)co(dot)uk>
To:	"'codeWarrior'" <gpatnude(at)hotmail(dot)com>, <pgsql-sql(at)postgresql(dot)org>
Subject:	Re: idea for a geographically distributed database: how best to implement?
Date:	2005-11-17 17:01:38
Message-ID:	ECOWS01M0A65lEzwZAc0003bb49@smtp-out1.blueyonder.co.uk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-sql

Hi again,

I don't think I've explained my idea well enough:

>You might want to consider using latitude and longitude or zip codes or
taking more of a traditional "GIS" approach rather than duplicating data
across redundant databases.

I do use coordinates to position data. The databases aren't redundant. Each
database serves a particular region, the aim being to ensure that no one
database gets too big for the machines in my farm to cope with.

>Another issue is that you end up having to query every database to find
proximity... Suppose you have 500 "cells" ? You now have to invoke some sort

of RPC mechanism on 499 other RDBMS, manage 500 user names and passwords,
retrieve potentially 500 recordsets, merge the data, etc...

1. It doesn't matter how many cells there are. A user's details are only
found in the database covering the area he lives, and possibly in the
neighbouring databases (at most 3, if he lives in the corner of the cell).

2. The neighbouring cells of cell A never change, therefore you don't need
to do anything clever to identify which other databases to write to in these
edge conditions, since when the user logs in, the relationship information
is read just once (just like the user's name and picture)

3. No merging ever needs to take place during queries. Here's an example.
Let's say, you, a resident of cell A, say 'show me who lives nearby!'. The
database of cell A has *all* the information. Why? Well, when a resident in
nearby cell B registered, his details were also added to cell A at that
time. Database A doesn't know, or care, how the data it holds got there.

4. Which database do I use to serve all your requests? Well, yes, once, I
have to work out which one it is, based on where you are, but that's almost
no work at all (see next point...)

5. I didn't mention before that my scheme *does* also require a national
database, but the size of that is well defined, as the only jobs it needs to
do are:
a) Know the names of all the places,
b) Know the definition of the existing cells, and their positions - to make
it easy to manage the cell structure, and to know where to send you when you
login
c) A set of global sequences used to generate unique ids for all the other
databases. When your record goes into cell A and cell B, the id of this
record is the *same*...

>Your problems will continue to get more and more complex... You are better
off with a single RDBMS and a single schema...

I currently have a single database. It is clearly simpler. However, I also
have a headache about what machinery to buy so that:
a) it starts off cheap
b) It is easy to expand, if and when the website takes off.

Yes, if I knew how many people would use this service, and how quickly it
would expand, I'd be able to raise the capital to buy a huge nationwide
server.

But I don't. So the answer 'how big does your DB server need to be' is
simply 'I can only make a very poor guess'. This doesn't look good in a
business plan. Far better if I can demonstrate an application which can
start off on a single simple server, and can elegantly expand with the user
base, with the farm of servers growing over time.

Unlike many data sets, mine is almost totally partitioned geographically.
There is only *one* little detail - that of visibility of data in
neighbouring cells, and that is sorted out with my idea of duplicating
information between neighbours.

Hope that fills in some gaps...

Thanks for your comments,
Andy

"Andy Ballingall" <andy(at)areyoulocal(dot)co(dot)uk> wrote in message
news:ECOWS04Mp8nkfZyufzT00010fe4(at)smtp-out4(dot)blueyonder(dot)co(dot)uk(dot)(dot)(dot)
> Hello,
>
> I've got a database for a website which is a variant of the 'show stuff
> near
> to me' sort of thing.
>
> Rather than host this database on a single server, I have a scheme in mind
> to break the database up geographically so that each one can run
> comfortably
> on a small server, but I'm not sure about the best way of implementing it.
>
> Here's the scheme:
>
> --------------------------------
> Imagine that the country is split into an array of square cells.
> Each cell contains a database that stores information about people who
> live
> in the area covered by the cell.
>
> There's one problem with this scheme. What happens if you live near the
> edge
> of a cell?
>
> My solution is that any inserted data which lies near to the edge of cell
> A
> is *also* inserted in the database of the relevant neighbouring cell -
> let's
> say cell B.
>
> Thus, if someone lives in cell B, but close to the border with cell A,
> they'll see the data that is geographically close to
> them, even if it lies in cell A.
>
> --------------------------------
>
> Is this a common pattern?
>
> I could, of course, simply find every insert, update and delete in the
> application and alter the code to explicitly update all the relevant
> databases, but is there a more elegant way of simply saying: "Do this
> transaction on both Database A and Database B" monotonically?
>
> I've had a look at some replication solutions, but they all seem to
> involve
> replicating an entire database. The advantage of my scheme is that if I
> can
> distribute my application over large numbers of small servers, I'll end up
> with more bangs for the buck, and it'll be much easier to manage growth by
> managing the number of servers, and number of cells hosted on each server.
>
> Thanks for any suggestions!
> Andy Ballingall
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.3/173 - Release Date: 16/11/2005

In response to

Re: idea for a geographically distributed database: how best to implement? at 2005-11-17 16:00:43 from codeWarrior

Browse pgsql-sql by date

	From	Date	Subject
Next Message	Andy Ballingall	2005-11-17 17:05:53	Re: idea for a geographically distributed database: how best
Previous Message	codeWarrior	2005-11-17 16:00:43	Re: idea for a geographically distributed database: how best to implement?