From: | Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> |
---|---|
To: | Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> |
Cc: | Allan Kamau <allank(at)sanbi(dot)ac(dot)za>, pgsql-sql(at)postgresql(dot)org |
Subject: | Re: Removing redundant itemsets |
Date: | 2008-03-31 11:29:22 |
Message-ID: | 47F0CB12.5010506@postnewspapers.com.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
Craig Ringer wrote:
> Allan Kamau wrote:
>> Hi all,
>> I have a list of purchases (market basket) and I would like to select
>> non redundant longest possible patterns by eliminating
>> (creating/populating other table to contain only non redandant itemsets)
>> purchases having item lists which are fully included in at least one
>> other purchase.
>
> Here's a possibly slow and surely ugly solution (I think it's right,
> though I haven't done more than passing testing):
>
>
>
> CREATE VIEW togo_as_arr AS
> SELECT a.tid,
> ARRAY(SELECT item FROM togo b WHERE b.tid = a.tid ORDER BY item)
> AS items
> FROM togo a GROUP BY tid;
>
> SELECT arr_a.tid AS redundant_tid, arr_b.tid AS contained_by
> FROM togo_as_arr arr_a CROSS JOIN togo_as_arr arr_b
> WHERE arr_a.tid <> arr_b.tid AND arr_a.items <@ arr_b.items;
Alternately:
-- Helps with the massively repeated subquery below
CREATE INDEX togo_by_tid_and_item ON togo(tid,item);
-- Find any `a' for which `item_from_a_is_in_b' is
-- true for all items in `a'
SELECT a_tid AS is_redundant, b_tid AS contained_by
FROM (
-- For every item in every pair of purchases,
-- determine whether the item in purchase `a'
-- was also in purchase `b'.
SELECT
a.tid AS a_tid,
b.tid AS b_tid,
a.item AS item,
EXISTS(
-- Was this item from `a' also in the `b' purchase?
SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item
) AS item_from_a_is_in_b
FROM togo a INNER JOIN togo b ON (a.tid <> b.tid)
GROUP BY a.tid, b.tid, a.item) AS item_containment
GROUP BY a_tid, b_tid
HAVING every(item_from_a_is_in_b);
... which avoids the array building, but is actually considerably slower
on the trivial test data. That's not too surprising given that this
approach requires a subquery:
SELECT 1 FROM togo x WHERE x.tid = b.tid AND x.item = a.item
for EVERY item to be tested. Twice, actually, as each record appears in
both `a' and `b' positions.
I'd be very interested to see what happened on real world test data,
especially compared to doing the array accumulation based query off a
temporary table instead of a view.
I suspect it'll depend on the average number of items per purchase -
lots of items per purchase and the array building cost will dominate.
That's really just a guess, though.
I'm sure there's a properly smart way to do this that I just can't
figure out, but this is the best I've come up with so far.
--
Craig Ringer
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2008-03-31 11:48:34 | Re: Removing redundant itemsets |
Previous Message | Craig Ringer | 2008-03-31 10:53:28 | Re: Removing redundant itemsets |