Double checking my logic?

From: "Ted Byers" <r(dot)ted(dot)byers(at)rogers(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Double checking my logic?
Date: 2006-03-20 16:25:30
Message-ID: 04bb01c64c3a$e853ad90$6601a8c0@RnDworkstation
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have to deal with the following situation.

I get a daily feed of data for several tables. This data is retrieved from the supplier by ftp in the form of CSV text files. I know I can load the files using "load data". The relational structure between the files and tables is simple: one file contains data for a lookup table. But the data in the lookup table is dynamic, potentially changing from one day to the next (though in practice, the rate of change is likely to be very slow). I will also need to maintain a history in such a way that any and all changes in the data can be fully audited. This means that, for auditing purposes, I will need to be able to extract valid time for each item in the lookup table. The only likely change in the data will be that occassionally items will be added or removed from the lookup table. None of the data already present in the database will be editable (data validation happens before I get the feed). The only analysis of which I have been told is done daily, as the feed is received. At present, no-one looks at earlier data (although I expect that will change as I create a better structure for the database to support audits).

I am trying to create a flexible design so that refactoring will be simple if and when the assumptions or practices need to be changed.

I know how to handle all this using a brute force approach, but I expect that approach will be painfully slow. So here is what I am planning.

1) create a suite of tables corresponding to the files in the feed, with the addition of a date/time stamp containing the date and time on which the data being processed was received.
2) create a suite of temporary tables corresponding to the tables created in step 1 (but without the date/time stamp)
3) load the data into the temporary tables
4) analyse the data while it is in the temporary tables, storing the result of the analysis in new tables
5) copy the data into the permanent tables, and add the date and time stamp for the data (this date/time stamp is not present in the files retrieved).
6) free the temporary tables

Now, this second last step is brute force, adequate for all but one of the tables: the look up table. If I stick with the brute force approach, the lookup table will waste a significant amount of space. This won't be much initially, but it is guaranteed to get worse as time passes and I'd expect the lookup performance to degrade as the amount of data in the lookup table increases.

Each record in the lookup table represents a product, and both the name and the product will have valid time intervals that may not be related. The name may change because the product has been renamed for whatever reason (and we don't care why), or the name may disappear altogether because the product has been discontinued. We can distinguish the two cases because each product has an ID that remains valid while the product exists, and the ID won't be in the data at all if the product is discontinued.

I am considering creating an additional table just to lookup product names, but with two date and time stamps. The first would represent the first time the product name appears in the data and the last would represent the last time the product name is present in the data. The first of these would be edited only once, and that is on the first day for which we have data. I am torn between updating the last of these every day, until the name disappears, or leave it null until the name disappears. leaving it null would save on space, but updating it with the current data and time should save time since it would not be necessary to execute a complex conditional on every product ever included in the database. If we update it only for those items in today's data, those that were terminated before today will not have their date/time stamp updated, so the two fields will always represent the time interval for which the name is valid. A similar logic applies to the product ID.

Have I overlooked anything that is obvious to you? Any gotchas I should be aware of?

What opportunities for improving performance do you see?

The ultimate objective is to have this fully automated from a shell script that is configured by our sysop to execute at a specific time every day Monday through Friday. Can all of the steps I describe above be implemented using ANSI standard SQL, and thus put into a stored procedure, or should I look at doing some of it in a Perl script or java application? I suppose I will have to have at least a basic Java application, or perl script, if only to connect to the database and invoke any functions I have created to do this.

Any suggestions would be appreciated.

Thanks,

Ted

R.E. (Ted) Byers, Ph.D., Ed.D.
R & D Decision Support Solutions
http://www.randddecisionsupportsolutions.com/

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alex bahdushka 2006-03-20 17:54:46 Re: PANIC: heap_update_redo: no block
Previous Message Bart Golda 2006-03-20 15:49:39 Re: SSL or other libraries for Windows-to-Linux PostgreSQL connection?