Re: Compare rows

From: Greg Spiegelberg <gspiegelberg(at)cranel(dot)com>
To: PgSQL Performance ML <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Compare rows
Date: 2003-10-08 19:10:53
Message-ID: 3F84613D.8040207@cranel.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Josh Berkus wrote:
> Greg,
>
>
>>The data represents metrics at a point in time on a system for
>>network, disk, memory, bus, controller, and so-on. Rx, Tx, errors,
>>speed, and whatever else can be gathered.
>>
>>We arrived at this one 642 column table after testing the whole
>>process from data gathering, methods of temporarily storing then
>>loading to the database. Initially, 37+ tables were in use but
>>the one big-un has saved us over 3.4 minutes.
>
>
> Hmmm ... if few of those columns are NULL, then you are probably right ...
> this is probably the most normalized design. If, however, many of columns
> are NULL the majority of the time, then the design you should be using is a
> vertial child table, of the form ( value_type | value ).
>
> Such a vertical child table would also make your comparison between instances
> *much* easier, as it could be executed via a simple 4-table-outer-join and 3
> where clauses. So even if you don't have a lot of NULLs, you probably want
> to consider this.

You lost me on that one. What's a "vertical child table"?

Statistically, about 6% of the rows use more than 200 of the columns,
27% of the rows use 80-199 or more columns, 45% of the rows use 40-79
columns and the remaining 22% of the rows use 39 or less of the columns.
That is a lot of NULLS. Never gave that much thought.

To ensure query efficiency, hide the NULLs and simulate the multiple
tables I have a boatload of indexes, ensure that every query makees use
of an index, and have created 37 views. It's worked pretty well so
far

>>The reason for my initial question was this. We save changes only.
>>In other words, if system S has row T1 for day D1 and if on day D2
>>we have another row T1 (excluding our time column) we don't want
>>to save it.
>
>
> If re-designing the table per the above is not a possibility, then I'd suggest
> that you locate 3-5 columns that:
> 1) are not NULL for any row;
> 2) combined, serve to identify a tiny subset of rows, i.e. 3% or less of the
> table.

There are always, always, always 7 columns that contain data.

> Then put a multi-column index on those columns, and do your comparison.
> Hopefully the planner should pick up on the availablity of the index and scan
> only the rows retrieved by the index. However, there is the distinct
> possibility that the presence of 637 WHERE criteria will confuse the planner,
> causing it to resort to a full table seq scan; in that case, you will want to
> use a subselect to force the issue.

That's what I'm trying to avoid is a big WHERE (c1,c2,...,c637) <>
(d1,d2,...,d637) clause. Ugly.

> Or, as Joe Conway suggested, you could figure out some kind of value hash that
> uniquely identifies your rows.

I've given that some though and though appealing I don't think I'd care
to spend the CPU cycles to do it. Best way I can figure to accomplish
it would be to generate an MD5 on each row without the timestamp and
store it in another column, create an index on the MD5 column, generate
MD5 on each line I want to insert. Makes for a simple WHERE...

Okay. I'll give it a whirl. What's one more column, right?

Greg

--
Greg Spiegelberg
Sr. Product Development Engineer
Cranel, Incorporated.
Phone: 614.318.4314
Fax: 614.431.8388
Email: gspiegelberg(at)Cranel(dot)com
Cranel. Technology. Integrity. Focus.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message pgsql-performance 2003-10-08 19:22:08 Re: Sun performance - Major discovery!
Previous Message Greg Spiegelberg 2003-10-08 19:07:30 Re: Compare rows