Re: Picking 25 samples of every domain

From: Benoit Izac <benoit(at)izac(dot)org>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: Picking 25 samples of every domain
Date: 2014-10-10 19:13:14
Message-ID: 87iojrrb5h.fsf@izac.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Le 09/10/2014 à 23:25, Gary Warner écrivait :

> I have a set of Postgres tables that are related to URLs found in
> email. The tables are BIG. 40-50 million records per day. We are
> using them for some research into tricks spammers use to confound
> blacklists. When we parse the URLs, we pull out the "domain" portion of
> each URL and store it in a field called "top_domain". The full URL is
> available as "link".
>
> Through various forms of randomization, customization, and wild-carding,
> a domain may have as many as 1 million URLs per day. I am needing a
> query that would grab a sample number of URLs per domain (let's say 25
> for conversation) . . . something that in pseudo-code might look like
>
> for each top_domain in urltable do
> select top_domain, link limit 25;

CREATE TYPE urltable_type AS (top_domain text, link text);

CREATE OR REPLACE FUNCTION urltable_sample(integer)
RETURNS SETOF urltable_type AS $$
DECLARE
td text;
BEGIN
FOR td IN SELECT DISTINCT top_domain FROM urltable
LOOP
RETURN QUERY EXECUTE
'SELECT top_domain::text, link::text
FROM urltable
WHERE top_domain = ''' || td || '''
LIMIT ' || $1;
END LOOP;
RETURN;
END
$$ LANGUAGE 'plpgsql' ;

SELECT top_domain, link FROM urltable_sample(25);

> Thoughts on the fastest way to do a query like that?

No but I'm not an expert.

--
Benoit Izac

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Jude DaShiell 2014-10-13 20:49:28 is it possible to store results of aggregate calculations in table fields?
Previous Message Gary Warner 2014-10-09 21:25:41 Picking 25 samples of every domain