Re: Physical sites handling large data

From: "Shridhar Daithankar" <shridhar_daithankar(at)persistent(dot)co(dot)in>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Physical sites handling large data
Date: 2002-09-15 10:21:05
Message-ID: 3D84AC69.1006.13BDF7AC@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers

On 14 Sep 2002 at 9:39, Justin Clift wrote:

> Hi Scott,
>
> Good move. :)
>
> Shridhar, any idea of the kind of demands they'll be placing on the
> database?

First of all, thanks you all guys for your quick and helpful responses. Robert
E. Bruccoleri send me his sites description which gave me an idea what
postgresql can do at that scale.

I spent more than 3 hours looking for urls of such a large installation. Didn't
get a single one via google(May be I am a bad search person..) Now from what
Robert E. Bruccoleri tells me, there is a bigger installation of postgres than
mentioned anywhere else.

I would request people to put up some information if they hav such an
installation. I would do the same subject to permission from client. Postgres
deserves this sort of publicity. Such information is very crucial when it comes
to convince government bodies to consider open source alternatives to
commercial ones, thr. LUGs etc..

I understand that designing database for such a site would require detailed
knowledge of application. I am persuing my friend/colleague to get as much
information to get out of it..

Now to answer your queries..

> For example, does your friend have an idea of:
>
> a) how many clients will be simultaneously connecting to the database(s)
> normally, and at peak times

AFAIK, this job is like analysis of log data(telecom domain). So number of
clients would not be big if there is no/small parallalism to be extracted at
application level. Most importantly number of clients will not be fluctuating
much.. So it gives some deterministic levels of prediction, to say so..

> b) how sensitive to performance lag and downtime is the application?

Hardly.. This is a replica of production system. Data loss is not going to be
an issue. Of course if the database pulls wrong data and hence wrong results
that is obviously unacceptable.. But I guess things won't come down to that
level..

There are not set performance goals as of now. Obviously faster is better..

> c) what are the data integrity requirements? Large array's of data that
> are mission critical need to be treated differently than small arrays,
> especially when taking b) into consideration. Higher end non-intel
> servers generally have better features in their OS and hardware for
> dealing with large amounts of important data.

IMO if I have to write a evaluation and proposal for this task, intel hardware
won't feature in just because it does not have a proven 64 bit CPU. Personally
I would recommend HP hardware knowing the client's profile..

> d) what kind of stuff is your friend familar with? For example, is he
> ok with unix in general, etc?

That's OK. These guys are into HP-UX heavily.. That's not a problem. Postgres
has figured on their radar to save licensing costs towards oracle installation.

> "scott.marlowe" wrote:
> > Often times the real limiter for database performance is IO bandwidth and
> > subsystem, not the CPUs. After that memory access speed and bandwidth are
> > very important too, so I can see a big HP UX box beating the pants off of
> > a Xeon.

Perfectly agreed. Slightly OT here. The test I referred to in my OP, xeon
machine had mirrored SCSI raid and 533MHz FSB. HP-UX box was single PA-
8700/750MHz CPU with single SCSI disk.

IMO even bandwidth consideration were in xeon's favour. Only things in favour
of HP box were the 3MB on chip cache and 64 bit RISC CPU. If that makes so much
of a difference, intel machines won't stand a chance for a long time IMO..

> > My point being, spend more money on the drive subsystem than anything else
> > and you'll probably be fine, but postgresql may or may not be your best
> > answer. It may be better to use something like berkeley db to handle this
> > job than a SQL database.

Agreed but with few differences...

I am going to divide my pitch as follows.

1) Can postgresql can do the job?

Yes. Refer to FAQ for limits of postgresql and real world installations like
posted by Robert E. Bruccoleri in person, as example.

2) Why postgresql?

a) It can do the job as illustrated by 1.

b) Architecture of postgres

Some strong points,

i) It's a most complete and/or useful open source SQL implementation. If they
want, they can customise it as they want later. Using berkeley DB might do good
at this level but depending upon complexity of application(Which I don't know
much), I would not rather put it.

ii) Does not offer idiocies/features that would limit implementations. e.g.
since it relies on OS to take care of storage, it will run with same peace on
IDE disk to fiber array(or what ever highest end tech. available). One need not
wait till you get storage driver from database vendors.

Less unneeded features==Cleaner implementation

iii) Since table is split in multiples of 1GB, no upper limit on table size
and/or splitting the table across storages etc.. (Hand tweaking basically)

Some weak points(These guys are considering distributed databases but won't
mind spending on hardware if the proposal is that worth.)

i) A database can not span mulptiple machines. So clustering is out. If data
is split in multiple databases on multiple machines, application will have to
do merging etc.

Any pointers on this? Partitioning etc?

ii) No *out of box* replication. (OK I can take down this point but when mysql
pops up, I got to include this for fair comparison.)

iii) Being a process driven architecture, it can not process data in parallel
even if possible. e..g say a table is 100GB in size. So split across 100
segments on file system. But it can not return data from all 100 segments
simaltaneously because there is one process per connection. Besides it won't be
able to spread any computational load across multiple CPUs.

3) How to do it?

a) Get a 64 bit architecture. Depending upon computational requirements and
projected number of connections, add the CPUs. Start with one CPU. I guess that
would be enough given how good PA-Risc CPUs are. (OK I haven't used anything
else but PA-Risc CPUs look good to me for data manipulation)

b) Get loads of RAM. 4-8GB sounds good to me. May be even better.

On this topic, say I decide to devote 8GB to postgres that means 1048576
buffers for 8K page size. Need I push available number of shared memory
segments beyond this value? Can I tweak size of page? If I can would it help
for such an installation?

c) Get a *fast* storage. I would rather not elaborate on this point because the
client is suppose to know better being in telecom business. But would offer my
inputs based upon information from Scott etc. But I don't think it would come
to that.

Any suggestions? Modifications? This is a rough plot. Will finalise today
evening/tom morning.. ( I am from India BTW, just to let you know my time
zone..;-)). I hope I haven't missed any point I have thought of in last two
days..

Once again, thanks a lot for the help offered. I can not put in words how on
point all this thread has been..

Bye
Shridhar

--
DeVries' Dilemma: If you hit two keys on the typewriter, the one you don't want
hits the paper.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Sathish Vadhiyar 2002-09-15 14:34:57 Re: user defined data types - help
Previous Message Chris 2002-09-15 08:05:49 bigint datatype accepting floating numbers

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2002-09-15 15:01:16 Re: Physical sites handling large data
Previous Message Shridhar Daithankar 2002-09-15 09:23:58 Re: Reading a live database