Re: improvement suggestions for performance design

From: "Kalle Hallivuori" <kato(at)iki(dot)fi>
To: "Thomas Finneid" <tfinneid(at)ifi(dot)uio(dot)no>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: improvement suggestions for performance design
Date: 2007-07-06 11:30:30
Message-ID: c637d8bb0707060430x1656a796y728c88e9e2103ea5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Thomas & all,

2007/7/6, Thomas Finneid <tfinneid(at)ifi(dot)uio(dot)no>:
>
> Heikki Linnakangas wrote:
> > ISTM that a properly normalized schema would look something like this:

[example of tables per attr referencing main table containing only primary key]

> I agree that this is a way it could be done.

Indeed. Another way is postgresql-specific:

create table position (
posX int not null,
posY int not null,
);

create table population INHERITS position ( population int not null );
-- optionally:
create unique index population_idx(posX,posY,population);

This leads to each attr table inheriting posX, posY from position; you
never insert anything to position itself, but you can use it to list
all positions that have any attributes in any of the inherited tables
(in that sense it's a view over all its children).

> In any case, there is no point in having one table per attribute, as
> some attributes are logically grouped and can therefore be grouped
> toghether in the table. Since there are 5-10 groups of attributes, 5-10
> tables would be enough.

This sounds very sensible. This way you would send only 1 table (or
procedure, or prepared statement) name instead of as many attr_types
as you have attributes in a group.

So instead of calling 'your_procedure(type, posX, posY, data_type,
value)' for each 5 values separately you would call
'attrgroup_and_datatype_specific_procedure(posX, posY, value1, value2,
value3, value4, value5)'. Inside the procedure the inserts change from
'insert into attribute_values values (type, posX, posY, data_type,
value)' to 'insert into attrgroup_and_datatype_specific_table values
(posX, posY, value1, value2, value3, value4, value5)' -- so you save
four inserts and for each value inserted you use 2/5 extra fields
instead of 4. You are allowed to use shorter names for the tables and
procedures ;)

It should be trivial to hide this separation in client; you could even
create new tables for new kinds of attribute-datatype combinations
automatically on the fly.

> They have been treating their data this way for the last 20 years, and
> there is nothing on the horizon that tells neither them nor me that it
> will be any different the next 10 years. So I am not sure I need to plan
> for that.

Is it possible that normalization has been skipped originally because
of lack of resources or knowledge of the nature of data to be
imported, or lack of dynamism on the part of the original tools (such
as creation of type specific tables on the fly), that would now be
available, or at least worth a second look?

> > A normalized schema will also take less space, which means less I/O and
> > more performance,
>
> That is what I am trying to find out, if it is true for this scenario as
> well.

Well, you're saving four extra ints per each value, when you only need
two per 5-10 values.

If you happen to save numerical data as the value in the text field
for some data_types, you are losing a lot more.

> > because there's no need to store metadata like the
> > data_type, attr_type on every row.
>
> data_type and attr_type are not decorative meta_data, they are actively
> used as query parameters for each attribute, if they where not there I
> would not be able to perform the queries I need to do.

You can still express them as table or column names rather than extra
data per row.

> > Before you start fiddling with functions, I'd suggest that you try
> > batching the inserts with the JDBC PreparedStatement batch facility.
>
> I have done that, now I need to have something to compare it against,
> preferably a function written in plpgsql and one in c.
> So any other suggestions on how to efficiently bulk transfer the data to
> the db for insertion?

COPY is plentitudes faster than INSERT:
http://www.postgresql.org/docs/8.1/interactive/sql-copy.html

If you can't just push the data straight into the final table with
COPY, push it into a temporary table that you go through with the
database procedure.

Shameless plug: If you use Java and miss COPY functionality in the
driver, it's available at

http://kato.iki.fi/sw/db/postgresql/jdbc/copy/

I was able to practically nullify time spent inserting with that.

> Well, it has been tested and showed to make postgres perform much
> better, ie. 100 000 inserts separated between 4 threads performed much
> faster than with a single thread alone.

Sounds interesting. The results should still end up written into the
same table, so are you sure this didn't end up using the same time at
server end - would that even matter to you?

We ended up having best results with sequential batches of around 10
000 rows each.

> > BTW, I concur with Y Sidhu that with data volumes as high as you have,
> > partitioning is a good idea.
>
> Yes, I will be looking into to it.

Depending on distribution of your data, saving each attribute group
(or datatype, or both) to its own table will take you some way to the
same direction.

If you have no indexes and do no deletes (like it seems to be in your
case), size of table might not matter much.

It might make sense in your case, though, to name tables with times,
like attribute_values_YYYYMMDD, and create a new table for each chosen
period, be it month, day or even per batch. (It takes a few seconds to
create a table though.)

To keep viewing the data as your customer is used to, you can hide the
separation of data into partitions by inheriting each partial table
from an identical ancestor table, that then serves as a view over all
its children -- but that's already explained in the docs. Separation
into tables by attribute groups you have to hide with a view, or
procedures, preferably server-side.

Cheers,

--
Kalle Hallivuori +358-41-5053073 http://korpiq.iki.fi/

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Alvaro Herrera 2007-07-06 14:40:35 Re: Direct I/O
Previous Message Harald Armin Massa 2007-07-06 04:25:53 Re: PostgreSQL Configuration Tool for Dummies