Re: Matching unique primary keys

From: Richard Huxton <dev(at)archonet(dot)com>
To: Kevin Old <kold(at)carolina(dot)rr(dot)com>
Cc: pgsql <pgsql-general(at)postgresql(dot)org>
Subject: Re: Matching unique primary keys
Date: 2002-10-23 19:02:44
Message-ID: 200210231902.44191.dev@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wednesday 23 Oct 2002 4:39 pm, Kevin Old wrote:
> Richard,
>
> Thanks for your reply. I realized how confusing my post sounds, so I
> will attempt to clear things up.

Not confusing, but perhaps a little short. It's a tricky balancing act though
- 8K of problem and people will skip past.

> > > I have a table of data that is very large (502 columns) and whenever I
> > > query it, I only query on up to 4 columns and on average only 20 fields
> > > are returned. I've come up with a way to increase the speed of the
> > > queries. I could could put the 20 fields that are searched and
> > > returned into a separate table and then link them by a unique id.
> > > Sounds easy......but I'm stuck on something.
> > >
> > > I am dealing with about 1 million records per day.

> On average I store 50,000 - 150,000 records per hour.
>
> The type of data I am storing is cellular telephone records. I receive
> a datafeed of records in pipe delimited format, which I am to make
> searchable to users via a web interface. While I have this working, the
> search time for a few records over 2 hours is around 35 seconds.

Yuck - are these aggregate queries ("show me the totals per hour") or detail
queries ("give me a list of all calls between 0800 and 1130")?

> We are running a Sun 450 with 4 800Mhz processors and 2GB of RAM.

Good - someone will know enough about that setup to comment usefully. Me,
sorry no.

> > I'd think it's unlikely that your table should (in design terms) be over
> > 500 columns wide. If lots of these values are defaults or nulls then
> > that's a sign that things need to be split up.

> I agree with the design terms that a table should reach 500+ columns.
> But, they track a ton of data about every cell phone call. Yes, a lot
> of the fields are zeros, but they track things like electronics, etc and
> will only show data if something is wrong.

If 99% of your data is saying nothing, don't record it. Split it out to a
separate table and only record a row where there's something useful. If you
had to do it by hand, watching a big bank of lights and writing things down
in a book you'd only bother if they were red.

> I have done testing with
> placing the 20 coulmns in a separate table and searching on that table
> is lightening fast on the same search as above that took 35 seconds. It
> was so fast I had to have my code run the query 100 times in order to
> get a vaild Benchmark time from the Perl module!!

Right - so you've got your solution then.

> My main problem is the best way to track the data once it is broken up
> into separate tables. Possible solutions I have are:
>
> 1) Copy the 3 "unique fields" into every table (as you mention above).
> My only question with this is how am I to do a join on a multicolumn
> primary key? The only thing I can think of would be something like:
>
> SELECT *
> FROM data_index, data_tbl
> WHERE data_index.id1 = data_tbl.id1
> AND data_index.id2 = data_tbl.id2
> AND data_index.id3 = data_tbl.id3
>
> Seems inefficient to me to do it this way because of the 3 joins. Maybe
> I'm wrong?

No - that's what you do. It's slower than joining on a single int4, but the
big advantage is that each table then means something by itself. The
"secondary" tables can be split out/exported/whatever and you can see at a
glance what they refer too.

Having said that, 3 columns means more typing and plenty of people replace
multi-column keys with an auto-incrementing ID. Keep the unique index on the
3 columns though.

> 2) Put a sequence on the data_index (20 fields) and the data_tbl (all
> other fields), storing one hour of data per table. The sequence would
> be a int4 on each table. The appropriate data would be inserted in
> their respective tables and the sequences would be kept on the same
> number.
>
> I have a few questions on the sequence issue. My data is only inserted
> once. No updates. I want to be able to store and search as much data
> as I can.
>
> There are 3 ways I see to store this data:
>
> 1) I can have a master index table where the 20 fields will be stored
> for all hours. This will become a large table quickly, but will only be
> 20 fields long.
>
> This will allow me to query one table for everything, but when data from
> the the other 480 fields is needed I'd have to somehow link the data
> from the master index table to the appropriate data table....could get
> messy.

Not necessarily - if there are patterns to queries, you can create a view so
it looks like a simple query to the application. SQL is forgiving on
whitespace too, so laying out a query over several lines with indents helps a
lot too.

BTW - 480 fields still seems very wide - why not split it into half a dozen
tables based on what the fields mean?

> 2) I can have a table to keep up with the date, time, hourly index
> table, and hourly data table. I can use this as an aid in building my
> unioned query. The tables are named with date and hour in them.
> Assuming the table looks like:
>
> date date,
> time time,
> hr_idx_tbl char(16),
> hr_data_tbl char(16)
>
> Building a [union] query could be like:
>
> SELECT hr_idx_tbl
> FROM master_tbl_idx
> WHERE CONCAT(DATE," ",TIME) BETWEEN ("2002/10/23 04:00:00" AND
> "2002/10/23 09:00:00")
>
> The results would be index tables that I'd need to query. I'd then
> generate a select for each table, and then union them together.

If this is historical data, I'd go for two tables - "recent" and "old" - most
queries will be on the recent table (because that's how you're defining
recent) and if you need to go further back there's only two tables to union.
Of course, your situation may vary.

> 3) I can do the above, but store the data in tables by day instead of
> hours.

You'll have to test. It's going do depend on if people query for specific
hours or for totals for January.

> Ok, I think I've explained it as much as I can.
>
> On other note, how would I keep the sequences the same in both the index
> table and the data table? In a transaction, can I insert the 480 fields
> using the nextval() for the data table index field and then have another
> insert for the 20 fields and use currval() for the index table index
> field? Any other suggestions on how I'd do this?

That's the way. Note that you could do several of these in one transaction
which would help with performance on inserts.

BTW - when I mentioned splitting the tables, I didn't mean into 20+480 instead
of 500, I meant 20+40+23+56... etc. according to the *meaning* of the data.
If much of the data really isn't worth recording much of the time, I'd
probably have:

CREATE TABLE (
id1 int4,
id2 whatever,
id3 int4,
item_name varchar(64),
item_value varchar(16)
);

That means you don't have any data stored you don't need.

> With all that said, I'd appreciate any help you can offer.
>
> Any help is appreciated!

Hope it has been.

PS - another idea you may want to look into is pre-calculating totals per-hour
(or whatever) if that's the sort of thing your users are interested in.

PPS - it might be you don't want a database - if all you're doing is looking
up a few rows based on simple criteria, a flat-file system might be better.
--
Richard Huxton

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Richard Ray 2002-10-23 19:38:56 LDAP authentication
Previous Message Doug McNaught 2002-10-23 18:57:52 Re: Rollback deletes