poor execution plan because column dependence

From: Václav Ovsík <vaclav(dot)ovsik(at)i(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: poor execution plan because column dependence
Date: 2011-04-12 23:23:43
Message-ID: 20110412232343.GA15966@bobek.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,
I have done migration of the Request Tracker 3.8.9
(http://requesttracker.wikia.com/wiki/HomePage) from Mysql to
PostgreSQL in testing environment.
The RT schema used can be viewed at
https://github.com/bestpractical/rt/blob/3.8-trunk/etc/schema.Pg.
I have added full text search on table Attachments based on trigrams
(and still experimenting with it), but is is not interesting for the
problem (the problem is not caused by it directly).
The full text search alone works quite good. A user testing a new RT instance
reported a poor performance problem with a bit more complex query (more
conditions resulting in table joins).
Queries are constructed by module DBIx::SearchBuilder.
The problematic query logged:

rt=# EXPLAIN ANALYZE SELECT DISTINCT main.* FROM Tickets main JOIN Transactions Transactions_1 ON ( Transactions_1.ObjectId = main.id ) JOIN Attachments Attachments_2 ON ( Attachments_2.TransactionId = Transactions_1.id ) WHERE (Transactions_1.ObjectType = 'RT::Ticket') AND (main.Status != 'deleted') AND (main.Status = 'resolved' AND main.LastUpdated > '2008-12-31 23:00:00' AND main.Created > '2005-12-31 23:00:00' AND main.Queue = '15' AND ( Attachments_2.trigrams @@ text_to_trgm_tsquery('uir') AND Attachments_2.Content ILIKE '%uir%' ) ) AND (main.Type = 'ticket') AND (main.EffectiveId = main.id) ORDER BY main.id ASC;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=23928.60..23928.67 rows=1 width=162) (actual time=5201.139..5207.965 rows=649 loops=1)
-> Sort (cost=23928.60..23928.61 rows=1 width=162) (actual time=5201.137..5201.983 rows=5280 loops=1)
Sort Key: main.effectiveid, main.issuestatement, main.resolution, main.owner, main.subject, main.initialpriority, main.finalpriority, main.priority, main.timeestimated, main.timeworked, main.timeleft, main.told, main.starts, main.started, main.due, main.resolved, main.lastupdatedby, main.lastupdated, main.creator, main.created, main.disabled
Sort Method: quicksort Memory: 1598kB
-> Nested Loop (cost=0.00..23928.59 rows=1 width=162) (actual time=10.060..5120.834 rows=5280 loops=1)
-> Nested Loop (cost=0.00..10222.38 rows=1734 width=166) (actual time=8.702..1328.970 rows=417711 loops=1)
-> Seq Scan on tickets main (cost=0.00..5687.88 rows=85 width=162) (actual time=8.258..94.012 rows=25410 loops=1)
Filter: (((status)::text <> 'deleted'::text) AND (lastupdated > '2008-12-31 23:00:00'::timestamp without time zone) AND (created > '2005-12-31 23:00:00'::timestamp without time zone) AND (effectiveid = id) AND (queue = 15) AND ((type)::text = 'ticket'::text) AND ((status)::text = 'resolved'::text))
-> Index Scan using transactions1 on transactions transactions_1 (cost=0.00..53.01 rows=27 width=8) (actual time=0.030..0.039 rows=16 loops=25410)
Index Cond: (((transactions_1.objecttype)::text = 'RT::Ticket'::text) AND (transactions_1.objectid = main.effectiveid))
-> Index Scan using attachments2 on attachments attachments_2 (cost=0.00..7.89 rows=1 width=4) (actual time=0.008..0.009 rows=0 loops=417711)
Index Cond: (attachments_2.transactionid = transactions_1.id)
Filter: ((attachments_2.trigrams @@ '''uir'''::tsquery) AND (attachments_2.content ~~* '%uir%'::text))
Total runtime: 5208.149 ms
(14 rows)

The above times are for already cached data (repeated query).
I think the execution plan is poor. Better would be to filter table attachments
at first and then join the rest. The reason is a bad estimate on number of rows
returned from table tickets (85 estimated -> 25410 in the reality).
Eliminating sub-condition...

rt=# explain analyze select * from tickets where effectiveid = id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Seq Scan on tickets (cost=0.00..4097.40 rows=530 width=162) (actual time=0.019..38.130 rows=101869 loops=1)
Filter: (effectiveid = id)
Total runtime: 54.318 ms
(3 rows)

Estimated 530 rows, but reality is 101869 rows.

The problem is the strong dependance between id and effectiveid. The RT
documentation says:

EffectiveId:
By default, a ticket's EffectiveId is the same as its ID. RT supports the
ability to merge tickets together. When you merge a ticket into
another one, RT sets the first ticket's EffectiveId to the second
ticket's ID. RT uses this data to quickly look up which ticket
you're really talking about when you reference a merged ticket.

I googled the page http://wiki.postgresql.org/wiki/Cross_Columns_Stats

Maybe I identified the already documented problem. What I can do with this
situation? Some workaround?

Thanks in advance for any suggestions.
Best Regards
--
Zito

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Bob Lunney 2011-04-13 00:14:29 Re: poor execution plan because column dependence
Previous Message Tomas Vondra 2011-04-12 22:36:58 Re: Performance