Re: improvement suggestions for performance design

From: Thomas Finneid <tfinneid(at)ifi(dot)uio(dot)no>
To:
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: improvement suggestions for performance design
Date: 2007-07-05 22:23:36
Message-ID: 468D6F68.7090903@ifi.uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Heikki Linnakangas wrote:
> ISTM that a properly normalized schema would look something like this:
>
> create table position (
> posX int not null,
> posY int not null,
> primary key (posX, posY)
> );
>
> create table colour (
> posX int not null,
> posY int not null,
> colour varchar(50) not null,
> primary key (posX, posY),
> foreign key (posX, posY) references position (posX, posY)
> );
>
> create table population (
> posX int not null,
> posY int not null,
> population int notn u,
> primary key (posX, posY),
> foreign key (posX, posY) references position (posX, posY)
> );

I agree that this is a way it could be done.

> where colour and population are examples of attributes you want to
> store. If you have 100 different attributes, you'll have 100 tables like
> that. That may sound like a lot, but it's not.

In any case, there is no point in having one table per attribute, as
some attributes are logically grouped and can therefore be grouped
toghether in the table. Since there are 5-10 groups of attributes, 5-10
tables would be enough.

>
> This allows you to use proper data types for the attributes, as well as
> constraints and all the other goodies a good relational data model gives
> you
>
> It also allows you to build proper indexes on the attributes. For
> example, if you store populations as text, you're going to have a hard
> time building an index that allows you to query for positions with a
> population between 100-2000 efficiently.

Performing queries on the attribute value is of no interrest, so that
does not matter,

> These are all imaginary examples, but what I'm trying to point out here
> is that a proper relational schema allows you to manage and query your
> data much more easily and with more flexibility, allows for future
> extensions.

They have been treating their data this way for the last 20 years, and
there is nothing on the horizon that tells neither them nor me that it
will be any different the next 10 years. So I am not sure I need to plan
for that.

> A normalized schema will also take less space, which means less I/O and
> more performance,

That is what I am trying to find out, if it is true for this scenario as
well.

> because there's no need to store metadata like the
> data_type, attr_type on every row.

data_type and attr_type are not decorative meta_data, they are actively
used as query parameters for each attribute, if they where not there I
would not be able to perform the queries I need to do.

For performance reasons, you might
> actually want to not store the position-table at all in the above schema.
>
> An alternative design would be to have a single table, with one column
> per attribute:
>
> create table position (
> posX int not null,
> posY int not null,
> colour varchar(50),
> population int,
> ...
> primary key (posX, posY)
> )
>
> This is more space-efficient, especially if you have a lot of attributes
> on same coordinates. You can easily add and drop columns as needed,
> using ALTER TABLE.
>
>> I am still looking into the design of the tables, but I need to get at
>> proper test harness running before I can start ruling things out. And a
>> part of that, is for example, efficient ways of transferring the insert
>> data from the client to the db, instead of just single command inserts.
>> This is where bulk transfer by arrays probably would be preferable.
>
> Before you start fiddling with functions, I'd suggest that you try
> batching the inserts with the JDBC PreparedStatement batch facility.

I have done that, now I need to have something to compare it against,
preferably a function written in plpgsql and one in c.
So any other suggestions on how to efficiently bulk transfer the data to
the db for insertion?

> Splitting the inserts into multiple threads in your application sounds
> messy.

Well, it has been tested and showed to make postgres perform much
better, ie. 100 000 inserts separated between 4 threads performed much
faster than with a single thread alone.

> BTW, I concur with Y Sidhu that with data volumes as high as you have,
> partitioning is a good idea.

Yes, I will be looking into to it.

regards

thomas

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message lai yoke hman 2007-07-06 02:28:49 Direct I/O
Previous Message Heikki Linnakangas 2007-07-05 18:39:51 Re: improvement suggestions for performance design