Re: Problem with duplicate rows when FULL OUTER JOIN'ing 3 derived tables

From: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Problem with duplicate rows when FULL OUTER JOIN'ing 3 derived tables
Date: 2014-06-12 20:36:23
Message-ID: CAKFQuwYbHmp5sr-oTn0SEaTq-AcuA3H3gA1AOSDSviGXrRR_qA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On Thu, Jun 12, 2014 at 4:15 PM, Andreas Joseph Krogh-2 [via PostgreSQL] <
ml-node+s1045698n5807082h66(at)n5(dot)nabble(dot)com> wrote:

> På torsdag 12. juni 2014 kl. 21:52:22, skrev David G Johnston <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=5807082&i=0>>:
>
> On Thu, Jun 12, 2014 at 3:49 PM, David Johnston <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=5807077&i=0>> wrote:
>
>>
>>
>>> SELECT
>>> coalesce(apl.company_name, bns.company_name, pns.company_name,
>>> prs.company_name) AS company_name,
>>> coalesce(apl.project_name, bns.project_name, pns.project_name,
>>> prs.project_name) AS project_name,
>>> apl.qty AS
>>> num_apples,
>>> bns.qty AS
>>> num_bananas,
>>> pns.qty AS
>>> num_pineapples,
>>> prs.qty AS
>>> num_pears
>>> FROM
>>>
>>> company_apples AS apl
>>>
>>> FULL OUTER JOIN company_bananas AS bns
>>> ON apl.company_id = bns.company_id
>>> AND (apl.project_id IS NULL AND bns.project_id IS NULL OR
>>> apl.project_id = bns.project_id)
>>>
>>> FULL OUTER JOIN company_pineapples AS pns
>>> ON apl.company_id = pns.company_id
>>> AND (apl.project_id IS NULL AND pns.project_id IS NULL OR
>>> apl.project_id = pns.project_id)
>>>
>>> FULL OUTER JOIN company_pears AS prs
>>> ON apl.company_id = prs.company_id
>>> AND (apl.project_id IS NULL AND prs.project_id IS NULL OR
>>> apl.project_id = prs.project_id)
>>>
>>> ORDER BY company_name ASC, project_name ASC NULLS FIRST;
>>>
>>>
>>> This gives the result:
>>>
>>> *#* *company_name* *project_name* *num_apples* *num_bananas*
>>> *num_pineapples* *num_pears* 1 C1 NULL 2 2 10 NULL 2 C1 P1 5 12 NULL
>>> NULL 3 C1 P2 2 NULL NULL NULL 4 C2 NULL 3 NULL 10 NULL 5 C2 P1 3 NULL
>>> NULL NULL 6 C2 P2 3 NULL NULL NULL 7 C3 NULL NULL 8 NULL NULL 8 C3
>>> NULL NULL NULL NULL 7
>>> As you see, there are two rows for C3, which should have been 1, with
>>> num_bananas=8 and num_pears=7. There has to be something wrong with my FULL
>>> OUTER JOINs but I don't know what....
>>>
>>>
>>>
>>
>> ​Row #7 is from the right side of the outer join between apples and
>> bananas
>> Row #8 is from the right side of the outer join between apples and pears
>>
>> Since you are only using apple as the source of valid company-project
>> pairs only those records will be "correct" per your definition.
>>
>> Even if you do not use a CROSS JOIN between company/product you have to
>> obtain a master list of ​valid company-project pairs from ALL of the target
>> tables. You can then LEFT JOIN that master against each of the target
>> tables and be sure that you have a valid master record to attach to.
>>
>> The direct way to do this is:
>>
>> SELECT DISTINCT company_id, project_id FROM apples
>> UNION DISTINCT
>> SELECT DISTINCT company_id, project_id FROM pears
>> [and so forth]
>>
>> David J.
>>
>>
> ​Though I guess you could also simply chain together the FULL OUTER
> join:
>
> FROM ( ( (apple OUTER bananas)​ AS ab OUTER pears ) AS abp OUTER
> pineapples ) AS abpp
>
> David J.
>
>
> By OUTER, do you mean FULL OUTER JOIN here?
>

​Yes

> I'm unsure how to write the correct ON-clause of my FULL OUTER JOINs. Do
> you know how?
>

>

​You ON-clause was just fine...

> There will be more derived tables to FULL OUTER JOIN with so I need
> something robust, just don't know how to do it.
>
>

​WITH
a_src (companyid, projectid, a_count) AS ( SELECT companyid,
COALESCE(projectid,'N/A'), a_count FROM company_a)
, b_src​ (companyid, projectid, b_count) AS ( SELECT companyid,
COALESCE(projectid,'N/A'), b_count FROM company_b)
, left_master AS ( SELECT DISTINCT companyid, projectid FROM a_src UNION
DISTINCT SELECT DISTINCT companyid, projectid FROM b_src)
SELECT companyid, projectid, a_count, b_count
FROM left_master
LEFT JOIN a_src USING (companyid, projectid)
LEFT JOIN b_src USING (companyid, projectid)
;

If it is too slow to derive left_master you can consider adding triggers to
the company_product tables to maintain a separate table of known
combinations.

*The chaining version:*

​WITH
a_src (companyid, projectid, a_count) AS ( SELECT companyid,
COALESCE(projectid,'N/A'), a_count FROM company_a)
, b_src​ (companyid, projectid, b_count) AS ( SELECT companyid,
COALESCE(projectid,'N/A'), b_count FROM company_b)
, c_src​ (companyid, projectid, c_count) AS ( SELECT companyid,
COALESCE(projectid,'N/A'), c_count FROM company_c)
SELECT companyid, projectid, a_count, b_count, c_count
FROM (a_src FULL JOIN b_src USING (companyid, projectid)) ab_src FULL JOIN
c_src USING (companyid, projectid)) abc_src
;

Though you could also test whether the following is faster:

a_raw FULL JOIN b_raw ON ((a_raw.companyid, COALESCE(a_raw.projectid,
'N/A')) = (b_raw.companyid, COALESCE(b_raw.projectid, 'N/A')))

*Alternatively...go vertical:*

WITH
a_src (companyid, projectid, item_count, item_type) AS ( SELECT companyid,
COALESCE(projectid,'N/A'), a_count, 'A' FROM company_a)
, b_src​ (companyid, projectid, item_count, item_type) AS ( SELECT
companyid, COALESCE(projectid,'N/A'), b_count, 'B' FROM company_b)

SELECT *
FROM a_src

UNION ALL

SELECT *
FROM b_src
;

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Problem-with-duplicate-rows-when-FULL-OUTER-JOIN-ing-3-derived-tables-tp5806970p5807088.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Andreas Joseph Krogh 2014-06-12 21:28:33 Re: Re: Problem with duplicate rows when FULL OUTER JOIN'ing 3 derived tables
Previous Message Andreas Joseph Krogh 2014-06-12 20:07:46 Re: Re: Problem with duplicate rows when FULL OUTER JOIN'ing 3 derived tables