Avoiding double-counting in aggregates with more than one join?

From: Paul Jungwirth <pj(at)illuminatedcomputing(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Avoiding double-counting in aggregates with more than one join?
Date: 2016-11-18 17:16:23
Message-ID: 86b9ec78-925c-1935-bc9d-6bad4ceb1f40@illuminatedcomputing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi All,

I've noticed in the past that doing aggregates while joining to more
than one table can sometimes give you unintended results. For example,
suppose I have three tables: products, sales, and resupplies. In sales I
track what I sell, and in resupplies I track my own purchases to
increase inventory. Both have a foreign key to products. Now I want to
run a report showing the total dollars sold for each product versus the
total dollars spent for each product. I could try this:

SELECT p.id,
SUM(s.price * s.qty) AS total_sold,
SUM(r.price * r.qty) AS total_spent
FROM products p
LEFT OUTER JOIN sales s
ON s.product_id = p.id
LEFT OUTER JOIN resupplies r
ON r.product_id = p.id
GROUP BY p.id
;

That seems pretty safe, but actually I get bad answers,
for example if some product has this data:

sales
-----
sold 1 @ $2/ea

resupplies
----------
bought 1 @ $1/eq
bought 2 @ $1/ea

Then pre-grouping I have this:

p.id | s.qty | s.price | r.qty | r.price
-----+-------+---------+-------+--------
1 | 1 | $2 | 1 | $1
1 | 1 | $2 | 2 | $1

You can see the problem is that I'm going to double-count my sales.
What I really want is this:

p.id | s.qty | s.price | r.qty | r.price
-----+-------+---------+-------+--------
1 | 1 | $2 | 1 | $1
1 | | | 2 | $1

In the past I've always handled these situations by aggregating each
table separately
and only then joining things together:

WITH
s AS (
SELECT product_id,
SUM(price * qty) AS total_sold
FROM sales
GROUP BY product_id) s
),
r AS (
SELECT product_id,
SUM(price * qty) AS total_spent
FROM resupplies
GROUP BY product_id) r
)
SELECT p.id,
COALESCE(s.total_sold, 0),
COALESCE(r.total_spent, 0)
FROM products p
LEFT OUTER JOIN s
ON s.product_id = p.id
LEFT OUTER JOIN r
ON r.product_id = p.id
;

Since I've guaranteed that each CTE includes at most one row per product,
this is safe from double-counting errors.

But is there a better way? My approach feels verbose
and harder to read. Also you have to type COALESCE a lot. :-)
Is there some different way of doing things I haven't thought of yet?

Also I wonder about the performance merging all these subqueries together.
Would the final merging be any faster if I had an ORDER BY in each CTE?

It seems like this pattern comes up a lot;
what have others done about it?

Thanks,
Paul

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2016-11-18 18:42:03 Re: Avoiding double-counting in aggregates with more than one join?
Previous Message William Ivanski 2016-11-18 15:58:03 Re: Trim performance on 9.5