Re: Slow index scan on B-Tree index over timestamp field

From: Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Slow index scan on B-Tree index over timestamp field
Date: 2013-11-04 20:50:25
Message-ID: CAK42QYH42T1R8HmgTXEfoc_ZLpi5D3tcSX92+FvFD6tD71JA-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I should also say that table tweet has more than 400 millions hows and
table tweet_topic has estimated more than 800 millions rows.

Thanks again,
Caio

On Mon, Nov 4, 2013 at 6:44 PM, Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com>wrote:

> Thank you very much for your answers guys!
>
>
> On Mon, Nov 4, 2013 at 5:15 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> On Sun, Nov 3, 2013 at 2:05 PM, Caio Casimiro <casimiro(dot)listas(at)gmail(dot)com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I have one query running at ~ 7 seconds and I would like to know if it's
>>> possible to make it run faster, once this query runs lots of time in my
>>> experiment.
>>>
>>
>>
>> Do you mean you want it to be fast because it runs many times, or that
>> you want it to become fast after it runs many times (i.e. once the data is
>> fully cached)? The plan you show takes 24 seconds, not 7 seconds.
>>
>
> I want it to be fast because it runs many times. I have an experiment that
> evaluates recommendation algorithms for a set of twitter users. This query
> returns recommendation candidates so it is called a lot of times for
> different users and time intervals.
>
>
>>
>>
>>>
>>> Basically the query return the topics of tweets published by users that
>>> the user N follows and that are published between D1 and D2.
>>>
>>> *Query*:
>>>
>>> SELECT tt.tweet_id, tt.topic, tt.topic_value
>>> FROM tweet_topic AS tt LEFT JOIN tweet AS t ON tt.tweet_id =
>>> t.id
>>> WHERE creation_time BETWEEN 'D1' AND 'D2' AND user_id in
>>> (SELECT followed_id FROM relationship WHERE follower_id = N)
>>> ORDER BY tt.tweet_id;
>>>
>>
>>
>> I don't know if this affects the plan at all, but it is silly to do a
>> left join to "tweet" when the WHERE clause has conditions that can't be
>> satisfied with a null row. Also, you could try changing the IN-list to an
>> EXISTS subquery.
>>
>
> I'm sorry the ignorance, but I don't understand the issue with the left
> join, could you explain more?
>
>
>> Is there some patterns to D1 and D2 that could help the caching? For
>> example, are they both usually in the just-recent past?
>>
> The only pattern is that it is always a one day interval, e.g. D1 =
> '2013-05-01' and D2 = '2013-05-02'.
>
>>
>>
>> Indexes:
>>> "tweet_plk" PRIMARY KEY, btree (id) CLUSTER
>>> "tweet_creation_time_index" btree (creation_time)
>>> "tweet_id_index" hash (id)
>>> "tweet_ios_index" btree (id, user_id, creation_time)
>>> "tweet_retweeted_idx" hash (retweeted)
>>> "tweet_user_id_creation_time_index" btree (creation_time, user_id)
>>> "tweet_user_id_index" hash (user_id)
>>>
>>
>>
>> Are all of those indexes important? If your table is heavily
>> updated/inserted, which I assume it is, maintaining those indexes is going
>> to take up precious RAM that could probably be better used elsewhere.
>>
>
> Probably not. But once this database is read only, the quantity of index
> grew following my desperation. =)
>
>>
>> Cheers,
>>
>> Jeff
>>
>
> Thank you very much again!
> Caio
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Igor Neyman 2013-11-04 20:52:57 Re: Slow index scan on B-Tree index over timestamp field
Previous Message Caio Casimiro 2013-11-04 20:44:05 Re: Slow index scan on B-Tree index over timestamp field