Re: Differences in identical queries

From: Rob Schall <rschall(at)callone(dot)net>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Differences in identical queries
Date: 2007-03-02 18:11:56
Message-ID: 45E868EC.3050101@callone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

The final answer wound up being....
ALTER TABLE current ALTER COLUMN orignum SET STATISTICS 100;
and then an ANALYZE current;

Then the queries started running like a champ (split seconds, not minutes).

Thanks Richard!
Rob

Richard Huxton wrote:
> Rob Schall wrote:
>> When I reanalyzed the anitmp table with just the 4 entries (2 with
>> istf=true and 2 with istf=false), both queries then ran the same
>> way/time.
>> So it would appear, if you want to do a join or a subselect (IN), then
>> the number of items if will be comparing it to must be less than a
>> certain number.
>
> I'm still not sure how you're concluding that.
>
> > In my case, the full amount of false's that fail is
>> actually 40. So if you have a table of 2 million items (in current) and
>> want to join it to a table of 40 items, then it must do the longer
>> hashed join version and not the nested loop.
>>
>> Below are the results with the anitmp table with 42 items. 40 set as
>> false, and 2 set as true. Is there a way to rewrite my query to have it
>> run as quick as the others?
>
>> calldetail=> EXPLAIN ANALYZE SELECT current.* FROM current JOIN anitmp
>> ON current.orignum=anitmp.ani AND istf=false;
>> QUERY PLAN
>> -----------------------------------------------------------------------------------------------------------------------------------
>>
>> Hash Join (cost=1.63..1031833.26 rows=256563 width=108) (actual
>> time=1889.469..155380.749 rows=653 loops=1)
>
> OK - so here's the root of the problem. The planner thinks it'll get
> back 256,563 rows but actually gets 653. If you actually got more than
> 200,000 rows back then a seq-scan on current might well make sense.
>
>> Hash Cond: ("outer".orignum = "inner".ani)
>> -> Seq Scan on current (cost=0.00..920527.00 rows=10873900
>> width=108) (actual time=670.402..136192.991 rows=10681150 loops=1)
>> -> Hash (cost=1.52..1.52 rows=41 width=8) (actual time=0.187..0.187
>> rows=0 loops=1)
>> -> Seq Scan on anitmp (cost=0.00..1.52 rows=41 width=8)
>> (actual time=0.014..0.108 rows=40 loops=1)
>> Filter: (istf = false)
>
> Hmm - what sort of distribution of values do you have in "orignum" -
> telephone numbers, so presumably they're quite distinct.
>
> I'd be tempted to up the statistics on that column, reanalyse both
> tables and see what happens.
> ALTER TABLE current ALTER COLUMN orignum SET STATISTICS=100;
> You can set values up to 1000, start at 100 and step up. Not checked
> the syntax on that last statement btw.
>
> You can see the before and after effects by looking at:
> SELECT * FROM pg_stats WHERE tablename='current' AND attname='orignum';
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2007-03-02 18:18:17 Re: Re : pg_dump, serial
Previous Message Laurent ROCHE 2007-03-02 17:54:47 Re : pg_dump, serial