Re: SELECT question

From: Michelle Konzack <linux4michelle(at)freenet(dot)de>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: SELECT question
Date: 2007-08-18 21:40:19
Message-ID: 20070818214018.GA10759@freenet.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Am 2007-08-17 12:53:41, schrieb Michael Glaesemann:
>
> On Aug 17, 2007, at 7:27 , Michelle Konzack wrote:
>
> >*********************************************************************
> >* Do not Cc: me, because I am on THIS list, if I write here. *
>
> You might want to consider changing your mailing list subscription
> settings to "eliminatecc", e.g., send email to
> majordomo(at)postgresql(dot)org (not the list address!) with body
>
> set pgsql-general eliminatecc
>
> This should prevent the mailing list from sending you a second copy.

Which mean, my "INBOX.ML_pgsql.general/" will never receive
messages and break all threads where someone send me CC's...

> I think what you want is something like:
>
> SELECT DISTINCT ON (website_reference) website_reference,
> download_date, file_path
> FROM indextable
> WHERE download_date <= ? -- whatever date you're interested in
> ORDER BY website_reference, download_date DESC;
>
> This should return the most recent website_reference and its
> download_date that's earlier than the download_date specified in the
> WHERE clause.
>
> DISTINCT ON is a (very helpful) PostgreSQL extension. You can get
> similar results using a subquery;

I have never used "DISTINCT ON" (it was not known to me)
and was trying subqueries... :-/

> SELECT website_reference, download_date, file_path
> FROM indextable
> NATURAL JOIN (
> SELECT website_reference, max(download_date) as download_date
> FROM indextable
> WHERE download_date <= ?
> GROUP BY website_reference
> ) most_recent_versions;
>
> This may return more than one row per website_reference if the
> website_reference has more than on file_path for a particular
> download_date.
>
> Does this help? If not, could you give a bit more of a concrete example?

I have an Enterprise which do researches :-) and I have a local cache
of more then 150.000.000 URL's and its content (~8 TByte)... (I have
hit over 2000 md5 collisons and now using sha384) Also I get per day
nearly 100.000 new files...

OK, HTML pages are downloaded and go into the first table like

indextable FULL_URL, SHA384SUM

and the second table

content SERNUM (uniq), SHA384SUM (pri), LOCAL_PATH

the saved file get as the name the SHA384 name

If I open an HTML-URL with a specific date, it is parsed and the URL's
inline are adapted to make it work from my database, e.g.

http://www.postgresql.org/index.html

will become

http://webcache/show.php?date=123456789&url=http://www.postgresql.org/index.html

inline elements and already downloaded other links will bekome

http://webcache/show.php?date=123456789&url=<original_url>

Thanks to the PostgreSQL developers that they have created "tablespace"
and "table partitioning" since searching in 150.000.000 ROW's is the
hell.

> (Is is just me or have there been a lot of queries that can be solved
> using DISTINCT ON recently?)

I do not know... Since when does "DISTINCT ON" exist?

Greetings
Michelle Konzack
Systemadministrator
Tamay Dogan Network
Debian GNU/Linux Consultant

--
Linux-User #280138 with the Linux Counter, http://counter.li.org/
##################### Debian GNU/Linux Consultant #####################
Michelle Konzack Apt. 917 ICQ #328449886
50, rue de Soultz MSN LinuxMichi
0033/6/61925193 67100 Strasbourg/France IRC #Debian (irc.icq.com)

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message dustov 2007-08-18 22:21:02 PGError: input out of range
Previous Message Karsten Hilbert 2007-08-18 21:39:56 Re: SUBSTRING performance for large BYTEA