Re: Two slightly different queries produce same results,

From: Andrew Schmidt <andrew(at)lifescale(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Two slightly different queries produce same results,
Date: 2005-11-11 15:14:18
Message-ID: 4374B54A.4090407@lifescale.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Richard,

While in the process of responding to your email I found out what was
doing it. It was the grouping of the team_players team_id instead of
the team team_id.

Here's some table numbers:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows

The query:
SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id;

It's grouping by the team_players tp.team_id, change the group to
t.team_id and bam! 5x faster.

thanks for the help,

-- Andrew

*******
This was my response to you Richard before I found out what it was.
*******
I removed position from the query, but still got the same query times.
I summed up the pl.player_id just to make sure it was reading the player
table.

EXPLAIN ANALYSE
SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id;

QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=8584.45..9364.83 rows=22390 width=12) (actual
time=894.364..1140.884 rows=313 loops=1)
-> Sort (cost=8584.45..8695.59 rows=44453 width=12) (actual
time=893.373..1013.989 rows=40928 loops=1)
Sort Key: tp.team_id
-> Hash Join (cost=174.52..4701.14 rows=44453 width=12)
(actual time=50.433..328.788 rows=40928 loops=1)
Hash Cond: ("outer".player_id = "inner".player_id)
-> Hash Join (cost=30.85..3840.72 rows=5731 width=12)
(actual time=7.279..110.578 rows=5005 loops=1)
Hash Cond: ("outer".player_id = "inner".player_id)
-> Nested Loop (cost=0.00..3723.90 rows=5731
width=8) (actual time=0.106..69.528 rows=5005 loops=1)
-> Index Scan using team_pool_id on team t
(cost=0.00..397.90 rows=391 width=4) (actual time=0.058..2.313 rows=313
loops=1)
Index Cond: (pool_id = 21699)
-> Index Scan using team_id_asdas on
team_players tp (cost=0.00..8.24 rows=21 width=8) (actual
time=0.027..0.122 rows=16 loops=313)
Index Cond: (tp.team_id = "outer".team_id)
-> Hash (cost=28.48..28.48 rows=948 width=4)
(actual time=7.121..7.121 rows=948 loops=1)
-> Seq Scan on player pl (cost=0.00..28.48
rows=948 width=4) (actual time=0.042..3.930 rows=948 loops=1)
-> Hash (cost=130.14..130.14 rows=5414 width=8) (actual
time=43.070..43.070 rows=5414 loops=1)
-> Seq Scan on player_updates pu
(cost=0.00..130.14 rows=5414 width=8) (actual time=0.013..22.410
rows=5414 loops=1)
Total runtime: 1144.059 ms
(17 rows)

---
VS fast query

EXPLAIN ANALYSE
SELECT tp.team_id,SUM(pl.player_id),SUM(goals) as total_goals
FROM
team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id = 21699
JOIN player_updates pu ON pu.player_id = tp.player_id
JOIN player pl ON pl.player_id = pu.player_id
WHERE t.pool_id = 21699
GROUP BY tp.team_id;


QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=270.31..278.80 rows=566 width=12) (actual
time=409.428..410.551 rows=313 loops=1)
-> Hash Join (cost=89.66..266.06 rows=566 width=12) (actual
time=120.341..278.310 rows=40928 loops=1)
Hash Cond: ("outer".player_id = "inner".player_id)
-> Seq Scan on player_updates pu (cost=0.00..130.14 rows=5414
width=8) (actual time=0.019..16.833 rows=5414 loops=1)
-> Hash (cost=89.48..89.48 rows=73 width=12) (actual
time=120.209..120.209 rows=5005 loops=1)
-> Hash Join (cost=53.16..89.48 rows=73 width=12)
(actual time=78.672..102.434 rows=5005 loops=1)
Hash Cond: ("outer".player_id = "inner".player_id)
-> Seq Scan on player pl (cost=0.00..28.48
rows=948 width=4) (actual time=0.040..3.092 rows=948 loops=1)
-> Hash (cost=52.97..52.97 rows=73 width=8)
(actual time=78.577..78.577 rows=5005 loops=1)
-> Nested Loop (cost=0.00..52.97 rows=73
width=8) (actual time=0.100..57.729 rows=5005 loops=1)
-> Index Scan using team_pool_id on
team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.050..1.591
rows=313 loops=1)
Index Cond: ((pool_id = 21699)
AND (pool_id = 21699))
-> Index Scan using team_id_asdas on
team_players tp (cost=0.00..8.24 rows=21 width=8) (actual
time=0.021..0.099 rows=16 loops=313)
Index Cond: (tp.team_id =
"outer".team_id)
Total runtime: 411.629 ms
(15 rows)

Time: 417.726 ms

I ran ANALYZE on each of the tables, just to be sure, but there was no
changes.

Also, here's some table info:

team => 31720 rows,
team_players => 464896 rows,
player => 948 rows
player_updates => 5414 rows

thanks,

-- Andrew

>> The explain analyze:
>> -> Index Scan using team_pool_id on
>> team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
>> rows=313 loops=1)
>> Index Cond: ((pool_id = 21699)
>> AND (pool_id = 21699))
>
>
> Here it sees two conditions, and PG is expecting that will restrict
> things further. It's clearly not expecting the same condition twice,
> but is presumably treating it something like "a>-4 AND a<4". This
> actually makes the row estimate wrong.
>
> So - if PG is getting this bit wrong why is query 2 faster?
> Looking at the row estimates for case 1 we can see that the final
> cases are badly wrong.
>
> > GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual
> > time=1186.973..1432.548 rows=626 loops=1)
>
> Here it's got the number of rows badly wrong - if you actually ended
> up with 44,000 rows then maybe this would be the best plan. This is
> the last step though, so what happens before?
>
> > -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual
> > time=1186.237..1309.562 rows=40928 loops=1)
> > Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)
>
> We're about right here. The problem must be in the last step, and
> looking at the row estimates it seems to think that no summarising is
> actually going to take place.
>
> I think the "position = 'G'" bit is confusing it. Well, we can test that:
>
> 1. Just group by "pl.position" - remove the test. Do the estimated and
> actual rows match now?
>
> 2. Rewrite the position='G' as a CASE structure
>
> 3. Do the position='G' part in a sub-query and join to that.
>

> The results of those should show whether this is actually the problem.
> --
> Richard Huxton
> Archonet Ltd
>
> .
>

--
_____________________________________________
G l o b a l D i a g n o s t i c s I n c.
Andrew Schmidt t.416-304-0049 x206
aschmidt(at)lifescale(dot)com f.866-697-8726
_____________________________________________

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jaime Casanova 2005-11-11 15:26:13 Re: Best way to use indexes for partial match at
Previous Message Tom Lane 2005-11-11 14:57:16 Re: return next