Re: UNION versus SUB SELECT

From: Robert DiFalco <robert(dot)difalco(at)gmail(dot)com>
To: Elliot <yields(dot)falsehood(at)gmail(dot)com>
Cc: desmodemone <desmodemone(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: UNION versus SUB SELECT
Date: 2013-11-21 22:04:02
Message-ID: CAAXGW-wz8Xu8TcxwFpopct2=L7MHU48Zfo__epcUE7js5bkfSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hmmmm...I'm not sure why the buffers option didn't work for me, maybe the
heroku psql is out of date. No, the query gets slower with a high load of
data and runs pretty often.

I just created a small test dataset for this. When I have a larger one I
will post new explain plans but the timings seem pretty consistent
regardless of the results returns (usually only 2-200) even when there are
millions of records in "contacts", "phone_numbers", and "email_addresses".

In this case doesn't the correlated query have to do more work and access
more columns than the subselect approach?

On Thu, Nov 21, 2013 at 1:22 PM, Elliot <yields(dot)falsehood(at)gmail(dot)com> wrote:

> On 2013-11-21 16:12, Robert DiFalco wrote:
>
> Sorry I couldn't get buffers to work but here is the explain analyze
> verbose:
>
> dft1fjfv106r48=> explain analyze verbose select c.*
>
> from contacts c
>
> where c.id IN (
>
> select
> p.contact_id
>
> from phone_numbers p
>
> where (p.national = 5038904993 and p.e164 = '
> +15038904993'))
> or c.id IN (
>
> select
> e.contact_id
>
> from email_addresses e
>
> where e.email = 'robert(dot)xxxxx(at)gmail(dot)com')
>
> ;
> QUERY
> PLAN
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on public.contacts c (cost=8.12..75.73 rows=1988 width=95)
> (actual time=0.410..0.410 rows=0 loops=1)
> Output: c.id, c.owner_id, c.user_id, c.device_id, c.last_call,
> c.record_id, c.dtype, c.blocked, c.details_hash, c.fname, c.lname, c.fb_id
> Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
> Rows Removed by Filter: 2849
> SubPlan 1
> -> Index Scan using idx_phone_address on public.phone_numbers p
> (cost=0.06..4.06 rows=1 width=8) (actual time=0.015..0.015 rows=0 loops=1)
> Output: p.contact_id
> Index Cond: ((p."national" = 5038904993::bigint) AND
> ((p.e164)::text = '+15038904993'::text))
> SubPlan 2
> -> Index Scan using idx_email_address on public.email_addresses e
> (cost=0.06..4.06 rows=1 width=8) (actual time=0.018..0.018 rows=0 loops=1)
> Output: e.contact_id
> Index Cond: ((e.email)::text = 'robert(dot)xxxxx(at)gmail(dot)com'::text)
> Total runtime: 0.489 ms
> (13 rows)
>
> dft1fjfv106r48=> explain analyze verbose select c.*
>
> from contacts c
>
> where exists(
>
> select 1
>
> from
> phone_numbers p
>
> where (p.national = 5038904993 and p.e164 = '+15038904993') and
> p.contact_id = c.id)
> or EXISTS(
>
> select 1
>
> from
> email_addresses e
>
> where e.email = 'robert(dot)xxxxx(at)gmail(dot)com' and e.contact_id = c.id)
>
> ;
>
> QUERY PLAN
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on public.contacts c (cost=0.00..21596.38 rows=1988 width=95)
> (actual time=0.479..0.479 rows=0 loops=1)
> Output: c.id, c.owner_id, c.user_id, c.device_id, c.last_call,
> c.record_id, c.dtype, c.blocked, c.details_hash, c.fname, c.lname, c.fb_id
> Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR
> (alternatives: SubPlan 3 or hashed SubPlan 4))
> Rows Removed by Filter: 2849
> SubPlan 1
> -> Index Scan using idx_phone_address on public.phone_numbers p
> (cost=0.06..4.06 rows=1 width=0) (never executed)
> Index Cond: ((p."national" = 5038904993::bigint) AND
> ((p.e164)::text = '+15038904993'::text))
> Filter: (p.contact_id = c.id)
> SubPlan 2
> -> Index Scan using idx_phone_address on public.phone_numbers p_1
> (cost=0.06..4.06 rows=1 width=8) (actual time=0.010..0.010 rows=0 loops=1)
> Output: p_1.contact_id
> Index Cond: ((p_1."national" = 5038904993::bigint) AND
> ((p_1.e164)::text = '+15038904993'::text))
> SubPlan 3
> -> Index Scan using idx_email_address on public.email_addresses e
> (cost=0.06..4.06 rows=1 width=0) (never executed)
> Index Cond: ((e.email)::text = 'robert(dot)xxxxx(at)gmail(dot)com'::text)
> Filter: (e.contact_id = c.id)
> SubPlan 4
> -> Index Scan using idx_email_address on public.email_addresses e_1
> (cost=0.06..4.06 rows=1 width=8) (actual time=0.016..0.016 rows=0 loops=1)
> Output: e_1.contact_id
> Index Cond: ((e_1.email)::text = 'robert(dot)xxxxx(at)gmail(dot)com
> '::text)
> Total runtime: 0.559 ms
> (21 rows)
>
> dft1fjfv106r48=> explain analyze verbose select * from contacts where id
> IN (
> (select c.id
>
> from contacts c
>
> join phone_numbers p on
> c.id = p.contact_id and p.national = 5038904993 and p.e164 = '+15038904993')
> union
> (select c.id
>
> from contacts c
>
> join email_addresses e on c.id = e.contact_id
> and e.email = 'robert(dot)xxxxx(at)gmail(dot)com'));
>
>
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=16.31..24.39 rows=2 width=95) (actual
> time=0.060..0.060 rows=0 loops=1)
> Output: contacts.id, contacts.owner_id, contacts.user_id,
> contacts.device_id, contacts.last_call, contacts.record_id, contacts.dtype,
> contacts.blocked, contacts.details_hash, contacts.fname, contacts.lname,
> contacts.fb_id
> -> Unique (cost=16.26..16.26 rows=2 width=8) (actual
> time=0.057..0.057 rows=0 loops=1)
> Output: c.id
> -> Sort (cost=16.26..16.26 rows=2 width=8) (actual
> time=0.055..0.055 rows=0 loops=1)
> Output: c.id
> Sort Key: c.id
> Sort Method: quicksort Memory: 25kB
> -> Append (cost=0.11..16.25 rows=2 width=8) (actual
> time=0.034..0.034 rows=0 loops=1)
> -> Nested Loop (cost=0.11..8.12 rows=1 width=8)
> (actual time=0.013..0.013 rows=0 loops=1)
> Output: c.id
> -> Index Scan using idx_phone_address on
> public.phone_numbers p (cost=0.06..4.06 rows=1 width=8) (actual
> time=0.011..0.011 rows=0 loops=1)
> Output: p.id, p.contact_id,
> p."national", p.e164, p.raw_number
> Index Cond: ((p."national" =
> 5038904993::bigint) AND ((p.e164)::text = '+15038904993'::text))
> -> Index Only Scan using
> idx_contacts_pkey_owner on public.contacts c (cost=0.06..4.06 rows=1
> width=8) (never executed)
> Output: c.id, c.owner_id, c.user_id
> Index Cond: (c.id = p.contact_id)
> Heap Fetches: 0
> -> Nested Loop (cost=0.11..8.12 rows=1 width=8)
> (actual time=0.018..0.018 rows=0 loops=1)
> Output: c_1.id
> -> Index Scan using idx_email_address on
> public.email_addresses e (cost=0.06..4.06 rows=1 width=8) (actual
> time=0.016..0.016 rows=0 loops=1)
> Output: e.id, e.contact_id, e.email
> Index Cond: ((e.email)::text = '
> robert(dot)xxxxx(at)gmail(dot)com'::text)
> -> Index Only Scan using
> idx_contacts_pkey_owner on public.contacts c_1 (cost=0.06..4.06 rows=1
> width=8) (never executed)
> Output: c_1.id, c_1.owner_id, c_1.user_id
> Index Cond: (c_1.id = e.contact_id)
> Heap Fetches: 0
> -> Index Scan using idx_contacts_pkey_owner on public.contacts
> (cost=0.06..4.06 rows=1 width=95) (never executed)
> Output: contacts.id, contacts.owner_id, contacts.user_id,
> contacts.device_id, contacts.last_call, contacts.record_id, contacts.dtype,
> contacts.blocked, contacts.details_hash, contacts.fname, contacts.lname,
> contacts.fb_id
> Index Cond: (contacts.id = c.id)
> Total runtime: 0.332 ms
> (31 rows)
>
>
>
> The buffers option is 9.0+ and is used like "explain (analyze, verbose,
> buffers) select 1".
> To your original question, the union output there runs slightly faster
> than the "in" approach, although this may not be a good example - your
> inputs don't return any data, so this might not be realistic - and those
> numbers are so low that the difference might just be noise.
> Are you tuning a <0.5ms-runtime query? Or is this just curiosity? FWIW I
> tend to write queries like this using an exists check first, then if that's
> still not good enough (all things like proper indexing taken in to account)
> I'll try an in check, then finally a union if that's still not good enough.
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message bricklen 2013-11-21 22:58:55 Re: UNION versus SUB SELECT
Previous Message Robert DiFalco 2013-11-21 21:12:36 Re: UNION versus SUB SELECT