Quick Links

Rewriting DISTINCT and losing performance

From:	"Chuck D(dot)" <pgsql-performance(at)nullmx(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Rewriting DISTINCT and losing performance
Date:	2007-05-21 04:28:30
Message-ID:	200705202228.32129.pgsql-performance@nullmx.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi all,

I know we've covered this before but I'm having trouble with it today.

I have some geographic data in tables that I'm working with. I have a
country, state and city table. I was selecting the country_name out of the
country table but discovered that some countries (like Antarctica) didn't
have cities in the city table.

I resolved to query the country table for only country_name's which had
country_id's in the city table - meaning the country had cities listed.

The problem was I had a couple different sources (in separate tables) with
some extraneous column data so I chose to consolidate the city tables from
the different sources and column data that I don't need because I don't have
the hardware to support it.

That was the end of my query time.

Here's the original table and query:

explain analyze
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.world_city)
;

QUERY
PLAN
-----------------------------------------------------------------------------
--------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual
time=85.502..3479.449 rows=231 loops=1)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual
time=0.089..0.658 rows=244 loops=1)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=14.250..14.250 rows=1
loops=244)
Index Cond: (country.country_id = world_city.country_id)
Total runtime: 3479.921 ms

Odd that it took 3 seconds because every previous run has been much quicker.
The next run was:

QUERY
PLAN
-----------------------------------------------------------------------------
------------------------------------------------------------------------
Nested Loop IN Join (cost=0.00..167.97 rows=155 width=15) (actual
time=0.087..6.967 rows=231 loops=1)
-> Seq Scan on country (cost=0.00..6.44 rows=244 width=15) (actual
time=0.028..0.158 rows=244 loops=1)
-> Index Scan using idx_world_city_country_id on world_city
(cost=0.00..8185.05 rows=12602 width=2) (actual time=0.026..0.026 rows=1
loops=244)
Index Cond: (country.country_id = world_city.country_id)
Total runtime: 7.132 ms
(5 rows)

But that was irrelevant. I created a new table and eliminated the data and
it looks like this:

explain analyze
SELECT country_id, country_name
FROM geo.country
WHERE country_id IN
(select country_id FROM geo.city)
;

-- won't complete in a reasonable amount of time.

This one won't use the country_id index. The two tables have almost the same
number of rows:

cmi=# select count(*) from geo.world_city;
count
---------
1953314
(1 row)

cmi=# select count(*) from geo.city;
count
---------
2122712
(1 row)

I tried to force it and didn't see any improvement. I've vacuummed,
analyzed, clustered. Can someone help me to get only the countries who have
cities in the city table in a reasonable amount of time?

-------------------------------------------------------

Responses

Re: Rewriting DISTINCT and losing performance at 2007-05-21 09:14:14 from Josh Berkus
Re: Rewriting DISTINCT and losing performance at 2007-05-21 11:40:21 from Richard Huxton

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Ralph Mason	2007-05-21 05:17:42	Re: Ever Increasing IOWAIT
Previous Message	Tom Lane	2007-05-20 21:55:50	Re: Ever Increasing IOWAIT