From: | "Richard Huxton" <dev(at)archonet(dot)com> |
---|---|
To: | "Tony Grant" <tony(at)animaproductions(dot)com>, "Mitch Vincent" <mvincent(at)cablespeed(dot)com> |
Cc: | <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: html to postgres... |
Date: | 2001-07-16 16:26:48 |
Message-ID: | 001801c10e14$1df1b880$1001a8c0@archonet.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
From: "Tony Grant" <tony(at)animaproductions(dot)com>
> On 16 Jul 2001 11:07:55 -0400, Mitch Vincent wrote:
> > You could put the entire HTML page directly into a text type field in
> > PG..... That would give you limited flexibility as far as searching and
> > indexing goes but you didn't mention any specifics of what you were
> > attempting to do by having the pages in a database....
>
> Yes I was vague - the heat is coming back...
>
> These are film and director pages in a movie site. I am looking at
> HTML->XML tools then with a parser I should be able to create a tab
> delimited text file.
Did something similar myself a while ago.
Assuming the pages were all generated from a template originally, I found
the following the simplest.
Construct your database structure, create some test data.
Create output system (db=>xml=>html whatever)
Build a (set of) perl script(s) to parse the HTML and strip the data out (if
your pages are anything like mine, they're not *identical* formats so you'll
end up needing something custom-built).
Push the data into PostgreSQL.
Publish the website from the database
Run a "diff" of the old and new pages
Tweak system as required and repeat until satisfied everything works.
The key problem I found was that unless the pages were generated from a
database to start with, they all seemed to have minor changes. The only way
I could be satisfied I'd not missed data was to publish and compare. The
first couple of "diff"s were scary, and I ended up cutting and pasting a few
pieces manually, but I got everything out.
HTH
- Richard Huxton
From | Date | Subject | |
---|---|---|---|
Next Message | Liz Pelletier | 2001-07-16 16:30:24 | db & table names |
Previous Message | markMLl.pgsql-general | 2001-07-16 16:14:19 | Re: html to postgres... |