| From: | Patrick JACQUOT <patrick(dot)jacquot(at)anpe(dot)fr> | 
|---|---|
| To: | Janning Vygen <vygen(at)gmx(dot)de> | 
| Cc: | pgsql-sql(at)postgresql(dot)org, ogjunk-pgjedan(at)yahoo(dot)com | 
| Subject: | Re: Need help: Find dirty rows, Update, Delete SQL | 
| Date: | 2006-02-20 11:24:32 | 
| Message-ID: | 43F9A6F0.7030701@anpe.fr | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-sql | 
Janning Vygen wrote:
>Am Samstag, 18. Februar 2006 18:41 schrieb ogjunk-pgjedan(at)yahoo(dot)com:
>  
>
>>Hello,
>>
>>I need a bit of help with some SQL.
>>I have two tables, call them Page and Bookmark.
>>Each row in Page can have many Bookmarks pointing to it, and
>>they are joined via a FK (Page.id = Bookmark.page_id).
>>
>>Page has a 'url' column: Page.url, which has a unique index on it.
>>
>>My Page.url column got a little dirty, and I need to clean it up,
>>and that's what I need help with.
>>
>>Here is an example of dirtiness:
>>
>>Page:
>>
>>id=1 url = 'http://example.com/'
>>id=2 url = 'http://example.com/#'     -- dirty
>>id=3 url = 'http://example.com/#foo'  -- dirty
>>
>>The last two rows are dirty.  Normally I normalize URLs before
>>inserting them, but these got in, and now I need to clean them.
>>
>>The problem is that rows in Bookmark table may point to dirty
>>rows in Page, so I can't just remove the dirty rows, and I can't
>>just update 'url' column in Page to 'http://example.com/',
>>because that column is unique.
>>
>>Is there some fancy SQL that I can use them to find the dirty
>>rows in page (... where url like '%#%') and then find rows in
>>Bookmark table that point to them, then point those rows to
>>good rows in Page (e.g. id=1 row above), and finally remove the
>>dirty rows from Page?
>>    
>>
>
>try this. But please check if it really does its job. I just wrote it down in 
>a minute or two. There will be an easier way or nicer written SQL but a sit 
>is just a one time operation you shoudn't care too much. One more hint: you 
>should add a CHECK clause to your page_url like  "page_url text NOT NULL 
>UNIQUE CHECK (page_url !~ '#')"
>
>here is my test code
>
>CREATE TABLE pages (
>  page_id SERIAL PRIMARY KEY,
>  page_url text NOT NULL UNIQUE
>);
>
>CREATE TABLE bookmarks (
>  bm_id SERIAL PRIMARY KEY,
>  bm_text text not null,
>  page_id int4 NOT NULL REFERENCES pages (page_id)
>);
>
>INSERT INTO pages (page_url) VALUES ('http://example.com/');
>INSERT INTO pages (page_url) VALUES ('http://example.com/#');
>INSERT INTO pages (page_url) VALUES ('http://example.com/#foo');
>INSERT INTO pages (page_url) VALUES ('http://example2.com/#foo');
>INSERT INTO pages (page_url) VALUES ('http://example3.com/#foobar');
>
>insert into bookmarks (bm_text, page_id) values ('test1', 1);
>insert into bookmarks (bm_text, page_id) values ('test2', 1);
>insert into bookmarks (bm_text, page_id) values ('test3', 2);
>insert into bookmarks (bm_text, page_id) values ('test4', 2);
>insert into bookmarks (bm_text, page_id) values ('test5', 3);
>insert into bookmarks (bm_text, page_id) values ('test6', 3);
>insert into bookmarks (bm_text, page_id) values ('test7', 4);
>
>BEGIN;
>UPDATE bookmarks set page_id = pages2.page_id
>FROM 
>  pages AS pages1, 
>  pages AS pages2
>WHERE 
>  pages1.page_id = bookmarks.page_id
>  AND pages2.page_url = split_part(pages1.page_url, '#', 1)
>;
>
>DELETE FROM pages WHERE page_id IN (
>  SELECT 
>    pages1.page_id
>  FROM
>    pages AS pages1 
>    JOIN pages AS pages2 ON ( 
>      pages1.page_id != pages2.page_id 
>      AND pages2.page_url = split_part(pages1.page_url, '#', 1)
>    )
>  WHERE position('#' in pages1.page_url) > 0
>  AND pages1.page_id NOT IN (SELECT page_id FROM bookmarks)
>);
>;
>
>UPDATE pages SET page_url = split_part(page_url, '#', 1) 
>WHERE position('#' in pages.page_url) > 0
>;
>select * from bookmarks;
>select * from pages;
>COMMIT;
>
>
>kind regards,
>janning
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>
>  
>
why do you consider as dirty perfectly honest URLs as 
http://example.com#foo ?
Such a construct points to a specific part (foo) of a specific document 
(http://example.com)
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bruno Wolff III | 2006-02-20 17:33:26 | Re: ORDER BY CASE ... | 
| Previous Message | Patrick JACQUOT | 2006-02-20 11:02:18 | Re: SELECT on a to-be-determined table |