Quick Links

Key features for data warehousing

From:	"Reiter, Oliver" <Oliver(dot)Reiter3(at)Dresdner-Bank(dot)com>
To:	"'pgsql-general(at)postgresql(dot)org'" <pgsql-general(at)postgresql(dot)org>
Subject:	Key features for data warehousing
Date:	2003-04-09 11:38:15
Message-ID:	62A6A06E481DD6119E350008C7A47F9D7877FF@ffz00zm1.ffz00e.mail.dresdner.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi,

I would love to recommend PG to our management, mostly because
MVCC does not have the severe problems with long transactions in
high-concurrency environments that locking databases have.

Here are the reasons why I can't:

1. The CLUSTER statement allows one-time physical row reordering.
But a real clustered index, available in DB2, Sybase and Informix,
maintains the clustering permanently and uses less space than
an ordinary index (which copies the key columns).
This is on the TODO list, but may be difficult to do with MVCC.
However I wouldn't miss it at all if we could have MDC instead!

2. MDC (Multi-Dimensional Clustering) is great for data warehousing.
For keys from relatively small finite sets, MDC allows efficient range
queries and joins in multiple dimensions, maintaining the cluistering
without explicit reorganisation. Key vectors are not stored repeatedly.

http://www7b.software.ibm.com/dmdd/library/techarticle/0207huras/0207huras.h
tml

http://216.239.39.100/search?q=cache:RGwncC6_dxUC:www-8.ibm.com/software/au/
universe/download/db2/id112.pdf+db2+multidimensional+clustering+organized&hl
=en&ie=UTF-8

3. I believe the row overhead is 40 bytes (recently reduced to 36 or 32 ?).
This is WAY too much for fact tables with many small rows. High-end
storage is pretty expensive, and less rows per page means more I/O.
I believe Sybase and Informix can do with 4, and Interbase / Firebird
(using MVCC) with 16 bytes per row.

Could anyone kindly explain why PostgreSQL needs this much space?
+ We need a pointer to the row (4 bytes) - unless the row size is fix.
+ There is the OID - unless deactivated in the table definition.
(I always wished OIDs were off by default, for better portability.)
+ For MVCC, each row must be associated with a transaction number
number or version - unless it is not involved in any transaction,
which
is usually true for most rows, and after a restart for all. There
must be
a way to recognize obsolete versions in VACUUM or after a crash, but
why isn't a tristate enough (committed/uncommitted/rolled back) ?
+ Why not hold the "version map" in a system table, only for versions
that are valid and not involved in any transactions?
+ Why not hold the "version map" in RAM, swapping or explicitly
paging parts to the disk when necessary?
+ And finally, what is all the rest for?

4. Unsexy but badly needed: Replication.

I would happily drop other alternatives if Replication and MDC were
implemented and the row overhead reduced to 16 bytes or less.

Looking forward to your answers and replies.
Best regards, Oliver

Responses

Re: Key features for data warehousing at 2003-04-09 17:31:56 from Andrew Sullivan
Re: Key features for data warehousing at 2003-04-10 00:38:50 from Martijn van Oosterhout

Browse pgsql-general by date

	From	Date	Subject
Next Message	Richard Huxton	2003-04-09 11:39:14	Re: aaagh... postgres is segfaulting?
Previous Message	Richard Huxton	2003-04-09 11:29:21	Re: constraint performance