Re: Two slightly different queries produce same results,

From: Richard Huxton <dev(at)archonet(dot)com>
To: Andrew Schmidt <andrew(at)lifescale(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Two slightly different queries produce same results,
Date: 2005-11-11 09:05:37
Message-ID: 43745EE1.1010208@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Andrew Schmidt wrote:
> However, I've run into a problem where one query took about twice as
> long as innodb. Some investigation and playing around with the query, I
> ended up with a slightly different query but was about 3 times as fast
> as innodb (and 5 times faster than the original query). I didn't add
> any indices and the EXPLAIN ANALYZE doesn't show any new indices being
> used with the new query. The order of which tables are read first was
> changed in the query planner and also it uses HashAggregate vs
> GroupAggregate.. not entirely sure what that means.

It's a different method of aggregating data (calculating your sum()).
The HashAggregate puts values into buckets and works best when you have
a fairly small number of different values (so each value can get its own
bucket). I think - bear in mind I'm just another user, so I'm not
familiar with the internals.

The different plans are because PG thinks it will get different numbers
of rows...

> SELECT tp.team_id, pl.position = 'G' AS is_goalie, SUM(goals) AS
> total_goals
> FROM
> team t JOIN team_players tp ON tp.team_id = t.team_id
> JOIN player_updates pu ON pu.player_id = tp.player_id
> JOIN player pl ON pl.player_id = pu.player_id
> WHERE t.pool_id = 21699
> GROUP BY tp.team_id,is_goalie;

> Explain analyze of this query:

> -> Index Scan using team_pool_id on team t
> (cost=0.00..397.90 rows=391 width=4) (actual time=0.061..2.129 rows=313
> loops=1)
> Index Cond: (pool_id = 21699)

With this index scan you have one condition and PG expects 391 matching
rows (it gets 313, so that's a pretty good estimate).

> SELECT tp.team_id,pl.position = 'G' AS is_goalie,SUM(goals) AS total_goals
> FROM
> team t JOIN team_players tp ON tp.team_id = t.team_id AND t.pool_id =
> 21699 <--- ** added pool_id
> JOIN player_updates pu ON pu.player_id = tp.player_id
> JOIN player pl ON pl.player_id = pu.player_id
> WHERE t.pool_id = 21699
> GROUP BY tp.team_id,is_goalie;
>
> The explain analyze:
> -> Index Scan using team_pool_id on
> team t (cost=0.00..10.44 rows=5 width=4) (actual time=0.049..1.556
> rows=313 loops=1)
> Index Cond: ((pool_id = 21699) AND
> (pool_id = 21699))

Here it sees two conditions, and PG is expecting that will restrict
things further. It's clearly not expecting the same condition twice, but
is presumably treating it something like "a>-4 AND a<4". This actually
makes the row estimate wrong.

So - if PG is getting this bit wrong why is query 2 faster?
Looking at the row estimates for case 1 we can see that the final cases
are badly wrong.

> GroupAggregate (cost=8742.52..9853.85 rows=44453 width=13) (actual
> time=1186.973..1432.548 rows=626 loops=1)

Here it's got the number of rows badly wrong - if you actually ended up
with 44,000 rows then maybe this would be the best plan. This is the
last step though, so what happens before?

> -> Sort (cost=8742.52..8853.66 rows=44453 width=13) (actual
> time=1186.237..1309.562 rows=40928 loops=1)
> Sort Key: tp.team_id, (pl."position" = 'G'::bpchar)

We're about right here. The problem must be in the last step, and
looking at the row estimates it seems to think that no summarising is
actually going to take place.

I think the "position = 'G'" bit is confusing it. Well, we can test that:

1. Just group by "pl.position" - remove the test. Do the estimated and
actual rows match now?

2. Rewrite the position='G' as a CASE structure

3. Do the position='G' part in a sub-query and join to that.

The results of those should show whether this is actually the problem.
--
Richard Huxton
Archonet Ltd

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andrus 2005-11-11 09:31:37 Re: Best way to use indexes for partial match at
Previous Message Guido Neitzer 2005-11-11 08:33:53 Re: Ordering and unicode