Re: SQL - planet redundant data

From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Poul Jensen <flyvholm(at)gfy(dot)ku(dot)dk>
Cc: Chris Travers <chris(at)verkiel(dot)metatrontech(dot)com>, tgl(at)sss(dot)pgh(dot)pa(dot)us, john(at)mitre(dot)org, pgsql-general(at)postgresql(dot)org
Subject: Re: SQL - planet redundant data
Date: 2005-09-18 00:31:11
Message-ID: 20050918003110.GQ7630@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sun, Sep 11, 2005 at 11:00:02PM -0800, Poul Jensen wrote:
> Tom Lane wrote:
>
> >No, tableoid is sort of a virtual column ... it doesn't exist on disk.
> >When you query it you get a value fetched from the internal data
> >structure representing the table.
> >
> >
> So virtual columns are possible - THIS is a way to clear redundant data!
> Is it possible for a user to create a virtual column? If not, this would
> make a big improvement.
>
> What I really need are "partial virtual columns". I'm imagining an
> alternative version of VACUUM ANALYZE that could do the following:
> 1) Order the rows in the table so that for each column, identical values
> are placed next to each other for as far as possible (the row order that
> optimizes one column will probably not be optimal for other columns).
> 2) For each column, identify the stretches that contain only one
> distinct value. Save that value together with ID of start and end row
> and delete stretch.
> It is not obvious how to do a perfect optimization process in 1), at
> least not to me - I'm sure a skilled mathematician would know exactly
> how to do it. But here's a simple approach that would get us part of the
> way:
> 1.1) Grab the column w. most redundancy (fewest distinct values) and
> sort it into groups according to the distinct values.
> 1.2) For each of these groups, grab the column w. next most redundancy
> and sort into groups according to the distinct values.
> And so on. Stop whenever groups become so small that there's nothing to
> gain.
> Such an analysis would make it much less expensive to combine
> same-schema tables, and having everything in the same table is really
> convenient. It would obviously save a lot of storage space, but I
> imagine it would enable more efficient queries too - having to check
> just 3 values instead of the thousands (or even millions) they may
> replace must give a considerable gain.
>
> 'What is the big benefit of not having ordered rows? I imagine it could
> be a disadvantage for dynamic databases, but for a static database like
> mine which won't be modified, except for maybe adding new data once a
> year, I imagine an optimization including row ordering could be highly
> beneficial.

Oracle supports something akin to this. On an Index Organized Table you
can tell it to pull part of the index key out of individual rows. IE:

CREATE TABLE (a, b, c, d, e, f)
PRIMARY KEY(a, b, c, d)
INDEX ORGANIZED COMBINE(a, b)
;

In this case, every time the combination of (a, b) changes, Oracle
stores a special record of some kind that indicates the change, and it
doesn't store a or b with each row. (Note that I'm guessing on syntax,
it's been a while since I've used Oracle).

> ##############################
>
> Jim C. Nasby wrote:
>
> >What you seem to be looking for is a form of partitioning. PostgreSQL
> >doesn't currently support partitioning of this form, but there's work in
> >progress to change that.
> >
> >
> Any idea how far out in the future this is? Would it make the
> optimization process described above (reply to Tom Lane) obsolete? Well,
> maybe my ideas about an optimal solution just illustrate lack of
> knowledge about SQL, but I'm hoping somebody can see what I'm trying to
> suggest.

Well, the firslt important step is slated for 8.1; effective partition
elimination. You can search for more info, especially in the BizGres
archives. Next step is creating actual partitioning syntax to make
managing partitions easier.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Warren Bell 2005-09-18 01:45:35 PDF Documentation?
Previous Message Martijn van Oosterhout 2005-09-17 22:04:47 Re: Duplicate Values or Not?!