Odd misprediction

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Odd misprediction
Date: 2011-09-16 14:50:40
Message-ID: CAGTBQpb0v7_jFnw-ohpWG7SJh=LHEeK9do2M7BKMUTFtZTLoug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

It's not an issue for me (it's not really impacting performance), but
since it was odd I thought I might ask.

I have this supermegaquery:

SELECT
t.date AS status_date, lu.id AS memberid, lu.username AS
username, u.url AS url, ub.url_pattern AS urlpattern, lu.email AS
email,
lu.birth_date AS birthdate, lu.creation_date AS creationdate,
s.name AS state, co.name AS country,
opd.survey_id AS originalSurvey, c.name AS city , lu.confirmed
AS confirmed , pd.name AS action , sd.duration AS loi
FROM tracks t
LEFT JOIN surveyduration_v sd
ON sd.member_id = t.member_id
AND sd.survey_id = 5936
INNER JOIN all_users_v lu
ON lu.id = t.member_id
AND lu.panel_source_id = 1
LEFT JOIN track_status ts
ON ts.id = t.track_status_id
LEFT JOIN partners p
ON p.id = t.partner_id
LEFT JOIN urls u
ON u.id = t.url_id
AND u.survey_id = 5936
LEFT JOIN url_batchs ub
ON u.url_batch_id = ub.id
LEFT JOIN states s
ON lu.state_id = s.id
LEFT JOIN cities c
ON lu.city_id = c.id
LEFT JOIN countries co
ON lu.country_id = co.id
LEFT JOIN partner_deliveries pd
ON pd.id = t.partner_delivery_id
AND t.partner_id IS NOT NULL
LEFT JOIN partner_deliveries opd
ON opd.id = pd.originator_id
WHERE t.survey_id = 5936
AND t.track_status_id IN (5)

With the views

CREATE OR REPLACE VIEW surveyduration_v AS
SELECT date_part('epoch'::text, t.date - tl2.date) / 60::double
precision AS duration, t.member_id, t.survey_id
FROM tracks t
JOIN track_logs tl2 ON t.id = tl2.track_id
WHERE tl2.track_status_id = 8 AND t.track_status_id = 7;

CREATE OR REPLACE VIEW all_users_v AS
SELECT 1 AS panel_source_id, livra_users.id,
livra_users.birth_date, livra_users.creation_date, livra_users.email,
livra_users.first_name, livra_users.last_name, livra_users.username,
livra_users.image_link, livra_users.confirmed,
livra_users.is_panelist, livra_users.unregistered, livra_users.reason,
livra_users.privacy, livra_users.sex, livra_users.site,
livra_users.country_id, livra_users.state_id, livra_users.city_id,
livra_users.last_activity_date, livra_users.partner_id,
livra_users.survey_id, livra_users.panelist_update,
livra_users.panelist_percentage
FROM livra_users
UNION ALL
SELECT 2 AS panel_source_id, - external_users.id AS id,
NULL::timestamp without time zone AS birth_date,
external_users.creation_date, external_users.email, NULL::character
varying AS first_name, NULL::character varying AS last_name,
external_users.username, NULL::character varying AS image_link, true
AS confirmed, external_users.is_panelist, false AS unregistered,
NULL::integer AS reason, 0 AS privacy, NULL::integer AS sex,
external_users.site, external_users.country_id, NULL::integer AS
state_id, NULL::integer AS city_id, NULL::timestamp without time zone
AS last_activity_date, NULL::integer AS partner_id,
external_users.survey_id, NULL::bigint AS panelist_update,
NULL::smallint AS panelist_percentage
FROM external_users;

Server is 9.0.3 running on linux

The BIG tables are tracks, track_logs and urls, all > 30M rows.

One detail that could be related is that tracks.member_id is an
undeclared (denoramlized) foreign key to livra_users.

The resulting plan is:

"Hash Left Join (cost=51417.93..974563.27 rows=2241518 width=1276)"
" Hash Cond: ("*SELECT* 1".country_id = co.id)"
" -> Hash Left Join (cost=51415.40..941722.50 rows=2241518 width=1271)"
" Hash Cond: ("*SELECT* 1".state_id = s.id)"
" -> Hash Left Join (cost=51373.45..910859.68 rows=2241518 width=1263)"
" Hash Cond: (t.partner_delivery_id = pd.id)"
" Join Filter: (t.partner_id IS NOT NULL)"
" -> Hash Left Join (cost=32280.78..854175.26
rows=2241518 width=1256)"
" Hash Cond: ("*SELECT* 1".city_id = c.id)"
" -> Hash Join (cost=24183.20..792841.63
rows=2241518 width=1249)"
" Hash Cond: ("*SELECT* 1".id = t.member_id)"
" -> Append (cost=0.00..148254.38
rows=3008749 width=168)"
" -> Subquery Scan on "*SELECT* 1"
(cost=0.00..140223.96 rows=3008748 width=168)"
" -> Seq Scan on livra_users
(cost=0.00..110136.48 rows=3008748 width=168)"
" -> Subquery Scan on "*SELECT* 2"
(cost=0.00..8030.42 rows=1 width=60)"
" -> Result (cost=0.00..8030.41
rows=1 width=60)"
" One-Time Filter: false"
" -> Seq Scan on
external_users (cost=0.00..8030.41 rows=1 width=60)"
" -> Hash (cost=24181.34..24181.34 rows=149
width=188)"
" -> Hash Left Join
(cost=21650.42..24181.34 rows=149 width=188)"
" Hash Cond: (u.url_batch_id = ub.id)"
" -> Nested Loop Left Join
(cost=20828.08..23355.84 rows=149 width=115)"
" -> Merge Left Join
(cost=20828.08..20841.04 rows=149 width=44)"
" Merge Cond:
(t.member_id = t.member_id)"
" -> Sort
(cost=435.90..436.27 rows=149 width=32)"
" Sort Key: t.member_id"
" -> Index
Scan using idx_tracks_survey_id_track_status_id on tracks t
(cost=0.00..430.52 rows=149 width=32)"
" Index
Cond: ((survey_id = 5936) AND (track_status_id = 5))"
" -> Sort
(cost=20392.18..20398.28 rows=2440 width=20)"
" Sort Key: t.member_id"
" -> Nested
Loop (cost=0.00..20254.90 rows=2440 width=20)"
" ->
Index Scan using idx_tracks_survey_id_track_status_id on tracks t
(cost=0.00..2010.03 rows=712 width=20)"
"
Index Cond: ((survey_id = 5936) AND (track_status_id = 7))"
" ->
Index Scan using idx_track_logs_track_id on track_logs tl2
(cost=0.00..25.59 rows=3 width=16)"
"
Index Cond: (tl2.track_id = t.id)"
"
Filter: (tl2.track_status_id = 8)"
" -> Index Scan using
urls_pkey on urls u (cost=0.00..16.87 rows=1 width=87)"
" Index Cond: (u.id =
t.url_id)"
" Filter: (u.survey_id = 5936)"
" -> Hash (cost=637.15..637.15
rows=14815 width=81)"
" -> Seq Scan on
url_batchs ub (cost=0.00..637.15 rows=14815 width=81)"
" -> Hash (cost=4578.37..4578.37 rows=281537 width=15)"
" -> Seq Scan on cities c
(cost=0.00..4578.37 rows=281537 width=15)"
" -> Hash (cost=18270.17..18270.17 rows=65799 width=19)"
" -> Hash Left Join (cost=8842.48..18270.17
rows=65799 width=19)"
" Hash Cond: (pd.originator_id = opd.id)"
" -> Seq Scan on partner_deliveries pd
(cost=0.00..8019.99 rows=65799 width=19)"
" -> Hash (cost=8019.99..8019.99 rows=65799 width=8)"
" -> Seq Scan on partner_deliveries
opd (cost=0.00..8019.99 rows=65799 width=8)"
" -> Hash (cost=24.20..24.20 rows=1420 width=16)"
" -> Seq Scan on states s (cost=0.00..24.20 rows=1420 width=16)"
" -> Hash (cost=1.68..1.68 rows=68 width=13)"
" -> Seq Scan on countries co (cost=0.00..1.68 rows=68 width=13)"

The curious bit is the rowcount (2241518) which is grossly
misestimated. It gets to that rowcount when joining the all_users_v
view with tracks, both partial results are estimated at ~150 rows
(more or less on target), it's a join of int PK to int column, so I
cannot imagine how that join could result in 2M rows, what is pg
thinking to get to that number?

Even a full cross product couldn't get that high.

Performance isn't impacted, the plan, even with the misestimation, is
near optimal. But I can imagine this kind of misestimation wreaking
havoc in other situations.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Stefan Keller 2011-09-16 23:15:26 Re: Hash index use presently(?) discouraged since 2005: revive or bury it?
Previous Message Merlin Moncure 2011-09-16 04:20:43 Re: Hash index use presently(?) discouraged since 2005: revive or bury it?