Re: Plan for relatively simple query seems to be very inefficient

From: "Dave Held" <dave(dot)held(at)arrayservicesgrp(dot)com>
To: "performance pgsql" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Plan for relatively simple query seems to be very inefficient
Date: 2005-04-06 17:45:31
Message-ID: 49E94D0CFCD4DB43AFBA928DDD20C8F9026184A0@asg002.asg.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> -----Original Message-----
> From: Arjen van der Meijden
> [mailto:acmmailing(at)vulcanus(dot)its(dot)tudelft(dot)nl]
> Sent: Wednesday, April 06, 2005 11:53 AM
> To: performance pgsql
> Subject: [PERFORM] Plan for relatively simple query seems to be very
> inefficient
>
> [...]
> SELECT COUNT(*) FROM
> data_main AS dm,
> postcodes AS p
> WHERE dm.range BETWEEN p.range_from AND p.range_till
> [...]
> Aggregate (cost=332586.85..332586.85 rows=1 width=0) (actual
> time=22712.038..22712.039 rows=1 loops=1)
> -> Nested Loop (cost=3.76..328945.96 rows=1456356
> width=0) (actual
> time=0.054..22600.826 rows=82688 loops=1)

I'm still a noob at reading EXPLAIN ANALYZE, but it seems to me
that your statistics are throwing off the planner here. It
estimates 1.4M and gets 82K, so it's off by a factor of about 20.
Have you considered doing a VACUUM or upping your statistics?

> [...]
> When I do something completely bogus, which will result in
> coupling the data per record from data_main on one record from
> postcodes, it still not very fast but acceptable:
> [...]
> Aggregate (cost=10076.98..10076.98 rows=1 width=0) (actual
> time=1456.016..1456.017 rows=1 loops=1)
> -> Merge Join (cost=8636.81..9913.13 rows=65537
> width=0) (actual
> time=1058.105..1358.571 rows=81920 loops=1)

Looks like Merge Join is faster than the Nested Loop for this
query. If you notice, the row counts are a lot closer to the
estimates, too. This is probably a "good" plan.

> [...]
> Doing something similarily bogus, but with less results is
> much faster, even though it should have basically the same
> plan:
>
> SELECT COUNT(*) FROM
> data_main AS dm,
> postcodes AS p
> WHERE dm.range = p.postcode_id
> [...]
> Aggregate (cost=2138.63..2138.63 rows=1 width=0) (actual
> time=180.667..180.668 rows=1 loops=1)
> -> Hash Join (cost=4.00..2087.02 rows=20642 width=0) (actual
> time=180.645..180.645 rows=0 loops=1)

This one I don't understand at all. Clearly, the Hash Join is
the way to go, but the estimates are way off (which probably
explains why this plan isn't chosen in the first place).

> Hash Cond: ("outer".range = "inner".postcode_id)
> -> Seq Scan on data_main dm (cost=0.00..1262.20
> rows=81920
> width=2) (actual time=0.005..105.548 rows=81920 loops=1)
> -> Hash (cost=3.60..3.60 rows=160 width=2) (actual
> time=0.592..0.592 rows=0 loops=1)
> -> Seq Scan on postcodes p (cost=0.00..3.60
> rows=160
> width=2) (actual time=0.025..0.349 rows=160 loops=1)
> Total runtime: 180.807 ms
> (7 rows)
> [...]

My completely amateur guess is that the planner is able to use
Merge Join and Hash Join on your contrived queries because you
are only trying to join one field to a single value (i.e.:
operator=). But the BETWEEN clause is what forces the Nested
Loop. You can see that here:

-> Seq Scan on postcodes p (cost=0.00..3.60 rows=160
width=4) (actual time=0.010..0.396 rows=160 loops=1)
vs. here:

-> Index Scan using postcodes_pkey on postcodes p
(cost=0.00..5.76 rows=160 width=2) (actual time=0.034..0.507 rows=160
loops=1)

So the first query forces a SeqScan on postcodes, while the
second can do an IndexScan.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Arjen van der Meijden 2005-04-06 18:00:09 Re: Plan for relatively simple query seems to be very inefficient
Previous Message Steinar H. Gunderson 2005-04-06 17:42:54 Re: Réf