Re: Slow index scan on B-Tree index over timestamp field

From: desmodemone <desmodemone(at)gmail(dot)com>
To: Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com>
Cc: Elliot <yields(dot)falsehood(at)gmail(dot)com>, Igor Neyman <ineyman(at)perceptron(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow index scan on B-Tree index over timestamp field
Date: 2013-11-04 22:59:27
Message-ID: CAEs9oFk6-eyyFdQ4t+cavE_NpvjrKJNzHj4W8LXGUV535okoQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,
I think you could try with an index on tweet table columns
"user_id, creation_time" [in this order , because the first argument is for
the equality predicate and the second with the range scan predicate, the
index tweet_user_id_creation_time_index is not ok because it has the
reverse order ] so the Hash Join between relationship and tweet will
become in theory a netsted loop and so the filter relationship.followed_id
= t.user_id will be pushed on the new index search condition with also
the creation_time > .. and creation_time < ... . In this manner you will
reduce the random i/o of the scanning of 1759645 rows from tweet that are
filter later now in hash join to 1679.

I hope it will work, if not, I hope you could attach the DDL of the table (
with constraints and indexes) to better understand the problem.

Bye

2013/11/4 Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com>

> Hi Elliot, thank you for your answer.
>
> I tried this query but it still suffer with index scan on
> tweet_creation_time_index:
>
> "Sort (cost=4899904.57..4899913.19 rows=3447 width=20) (actual
> time=37560.938..37562.503 rows=1640 loops=1)"
> " Sort Key: tt.tweet_id"
> " Sort Method: quicksort Memory: 97kB"
> " Buffers: shared hit=1849 read=32788"
> " -> Nested Loop (cost=105592.06..4899702.04 rows=3447 width=20)
> (actual time=19151.036..37555.227 rows=1640 loops=1)"
> " Buffers: shared hit=1849 read=32788"
> " -> Hash Join (cost=105574.10..116461.68 rows=1679 width=8)
> (actual time=19099.848..19127.606 rows=597 loops=1)"
> " Hash Cond: (relationship.followed_id = t.user_id)"
> " Buffers: shared hit=3 read=31870"
> " -> Index Only Scan using relationship_id on relationship
> (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721
> rows=106 loops=1)"
> " Index Cond: (follower_id = 335093362)"
> " Heap Fetches: 0"
> " Buffers: shared hit=2 read=3"
> " -> Hash (cost=83308.25..83308.25 rows=1781234 width=16)
> (actual time=19031.916..19031.916 rows=1759645 loops=1)"
> " Buckets: 262144 Batches: 1 Memory Usage: 61863kB"
> " Buffers: shared hit=1 read=31867"
> " -> Index Scan using tweet_creation_time_index on
> tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual
> time=48.595..13759.768 rows=1759645 loops=1)"
> " Index Cond: ((creation_time >= '2013-05-05
> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
> 00:00:00-03'::timestamp with time zone))"
> " Buffers: shared hit=1 read=31867"
> " -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63
> rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"
> " Recheck Cond: (tweet_id = t.id)"
> " Buffers: shared hit=1846 read=918"
> " -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78
> rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"
> " Index Cond: (tweet_id = t.id)"
> " Buffers: shared hit=1763 read=632"
>
> You said that I would need B-Tree indexes on the fields that I want the
> planner to use index only scan, and I think I have them already on the
> tweet table:
>
> "tweet_ios_index" btree (id, user_id, creation_time)
>
> Shouldn't the tweet_ios_index be enough to make the scan over
> tweet_creation_time_index be a index only scan? And, more important, would
> it be really faster?
>
> Thank you very much,
> Caio
>
>
> On Mon, Nov 4, 2013 at 7:22 PM, Elliot <yields(dot)falsehood(at)gmail(dot)com> wrote:
>
>> On 2013-11-04 16:10, Caio Casimiro wrote:
>>
>> Hi Neyman, thank you for your answer.
>>
>> Unfortunately this query runs almost at the same time:
>>
>> Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
>> time=25820.291..25821.845 rows=1640 loops=1)
>> Sort Key: tt.tweet_id
>> Sort Method: quicksort Memory: 97kB
>> Buffers: shared hit=1849 read=32788
>> -> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual
>> time=486.839..25814.120 rows=1640 loops=1)
>> Buffers: shared hit=1849 read=32788
>> -> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8)
>> (actual time=431.654..13209.159 rows=597 loops=1)
>> Hash Cond: (t.user_id = relationship.followed_id)
>> Buffers: shared hit=3 read=31870
>> -> Index Scan using tweet_creation_time_index on tweet t
>> (cost=0.57..83308.25 rows=1781234 width=16) (actual
>> time=130.144..10037.764 rows=1759645 loops=1)
>> Index Cond: ((creation_time >= '2013-05-05
>> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
>> 00:00:00-03'::timestamp with time zone))
>> Buffers: shared hit=1 read=31867
>> -> Hash (cost=227.12..227.12 rows=154 width=8) (actual
>> time=94.365..94.365 rows=106 loops=1)
>> Buckets: 1024 Batches: 1 Memory Usage: 3kB
>> Buffers: shared hit=2 read=3
>> -> Index Only Scan using relationship_id on
>> relationship (cost=0.42..227.12 rows=154 width=8) (actual
>> time=74.540..94.101 rows=106 loops=1)
>> Index Cond: (follower_id = 335093362)
>> Heap Fetches: 0
>> Buffers: shared hit=2 read=3
>> -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63
>> rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
>> Recheck Cond: (tweet_id = t.id)
>> Buffers: shared hit=1846 read=918
>> -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78
>> rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
>> Index Cond: (tweet_id = t.id)
>> Buffers: shared hit=1763 read=632
>> Total runtime: 25823.386 ms
>>
>> I have noticed that in both queries the index scan on
>> tweet_creation_time_index is very expensive. Is there anything I can do to
>> make the planner choose a index only scan?
>>
>>
>> Yes, because that part of the query is kicking back so many rows, many
>> of which are totally unnecessary anyway - you're first getting all the
>> tweets in a particular time range, then limiting them down to just users
>> that are followed. Here's clarification on the approach I mentioned
>> earlier. All you should really need are basic (btree) indexes on your
>> different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id,
>> relationship.follower_id, relationship.followed_id). I also changed the
>> left join to an inner join as somebody pointed out that your logic amounted
>> to reducing the match to an inner join anyway.
>>
>> SELECT tt.tweet_id, tt.topic, tt.topic_value
>> FROM tweet_topic AS tt
>> JOIN tweet AS t
>> ON tt.tweet_id = t.id
>> join relationship
>> on t.user_id = relationship.followed_id
>>
>> WHERE creation_time BETWEEN 'D1' AND 'D2'
>> AND relationship.follower_id = N
>> ORDER BY tt.tweet_id
>> ;
>>
>>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Caio Casimiro 2013-11-04 23:24:07 Re: Slow index scan on B-Tree index over timestamp field
Previous Message Caio Casimiro 2013-11-04 22:10:21 Re: Slow index scan on B-Tree index over timestamp field