Re: Slow index scan on B-Tree index over timestamp field

From: Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com>
To: Igor Neyman <ineyman(at)perceptron(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow index scan on B-Tree index over timestamp field
Date: 2013-11-04 21:10:15
Message-ID: CAK42QYGvZrZ9tjZrdw1xV4JgNikod+mFBhAB5Kr1NECOxL0Tmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Neyman, thank you for your answer.

Unfortunately this query runs almost at the same time:

Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
time=25820.291..25821.845 rows=1640 loops=1)
Sort Key: tt.tweet_id
Sort Method: quicksort Memory: 97kB
Buffers: shared hit=1849 read=32788
-> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual
time=486.839..25814.120 rows=1640 loops=1)
Buffers: shared hit=1849 read=32788
-> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8)
(actual time=431.654..13209.159 rows=597 loops=1)
Hash Cond: (t.user_id = relationship.followed_id)
Buffers: shared hit=3 read=31870
-> Index Scan using tweet_creation_time_index on tweet t
(cost=0.57..83308.25 rows=1781234 width=16) (actual
time=130.144..10037.764 rows=1759645 loops=1)
Index Cond: ((creation_time >= '2013-05-05
00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
00:00:00-03'::timestamp with time zone))
Buffers: shared hit=1 read=31867
-> Hash (cost=227.12..227.12 rows=154 width=8) (actual
time=94.365..94.365 rows=106 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 3kB
Buffers: shared hit=2 read=3
-> Index Only Scan using relationship_id on
relationship (cost=0.42..227.12 rows=154 width=8) (actual
time=74.540..94.101 rows=106 loops=1)
Index Cond: (follower_id = 335093362)
Heap Fetches: 0
Buffers: shared hit=2 read=3
-> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63
rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
Recheck Cond: (tweet_id = t.id)
Buffers: shared hit=1846 read=918
-> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78
rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
Index Cond: (tweet_id = t.id)
Buffers: shared hit=1763 read=632
Total runtime: 25823.386 ms

I have noticed that in both queries the index scan on
tweet_creation_time_index is very expensive. Is there anything I can do to
make the planner choose a index only scan?

Thank you,
Caio

On Mon, Nov 4, 2013 at 6:52 PM, Igor Neyman <ineyman(at)perceptron(dot)com> wrote:

>
>
> From: pgsql-performance-owner(at)postgresql(dot)org [mailto:
> pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Caio Casimiro
> Sent: Monday, November 04, 2013 3:44 PM
> To: Jeff Janes
> Cc: pgsql-performance(at)postgresql(dot)org
> Subject: Re: [PERFORM] Slow index scan on B-Tree index over timestamp field
>
> Thank you very much for your answers guys!
>
> On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com
> > wrote:
> Hello all,
>
> I have one query running at ~ 7 seconds and I would like to know if it's
> possible to make it run faster, once this query runs lots of time in my
> experiment.
>
>
> Do you mean you want it to be fast because it runs many times, or that you
> want it to become fast after it runs many times (i.e. once the data is
> fully cached)? The plan you show takes 24 seconds, not 7 seconds.
>
> I want it to be fast because it runs many times. I have an experiment that
> evaluates recommendation algorithms for a set of twitter users. This query
> returns recommendation candidates so it is called a lot of times for
> different users and time intervals.
>
>
>
> Basically the query return the topics of tweets published by users that
> the user N follows and that are published between D1 and D2.
>
> Query:
>
> SELECT tt.tweet_id, tt.topic, tt.topic_value
> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
> t.id
> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
> (SELECT followed_id FROM relationship WHERE follower_id = N)
> ORDER BY tt.tweet_id;
>
>
> I don't know if this affects the plan at all, but it is silly to do a left
> join to "tweet" when the WHERE clause has conditions that can't be
> satisfied with a null row. Also, you could try changing the IN-list to an
> EXISTS subquery.
>
> I'm sorry the ignorance, but I don't understand the issue with the left
> join, could you explain more?
> ...........................................
> Thank you very much again!
> Caio
>
>
> Just try the following:
>
> SELECT tt.tweet_id, tt.topic, tt.topic_value
> FROM tweet_topic AS tt JOIN tweet AS t ON (tt.tweet_id = t.id
> AND t.creation_time
> BETWEEN 'D1' AND 'D2' AND t.user_id in
> (SELECT followed_id FROM
> relationship WHERE follower_id = N))
> ORDER BY tt.tweet_id;
>
> And see if it helps with performance.
>
> Regards,
> Igor Neyman
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Elliot 2013-11-04 21:22:47 Re: Slow index scan on B-Tree index over timestamp field
Previous Message Igor Neyman 2013-11-04 20:52:57 Re: Slow index scan on B-Tree index over timestamp field