Re: slow plan on join when adding where clause

From: "L(dot) Loewe" <lloewe(at)hotmail(dot)com>
To: pgsql-novice(at)postgresql(dot)org
Subject: Re: slow plan on join when adding where clause
Date: 2010-04-08 18:47:10
Message-ID: op.vauzww0lb4gxma@oakcreek2
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Nope. Still slow with sensor_id = 1.
The problem appears to be unrelated to the indexes but the the planners
assumption that
sensor_id=1 will return about 500K rows, which is correct and
start<ti<stop will return 5000K rows, which is way off of the actual 800.

My current solution is to create a (temporary) table with all sensors and
use it for sensor based queries - it works but requires some maintenance
which I wanted to avoid.

regards

On Thu, 08 Apr 2010 10:40:14 -0600, Mladen Gogala <mgogala(at)vmsinfo(dot)com>
wrote:

> How about creating index on both columns and drop the separate index?
> The composite index on ti and sensor_id could still be used for the
> original query but would probably give you a killer performance when the
> columns are used together.
>
> L. Loewe wrote:
>> Thanks.
>>
>> That did change the plan a bit but it's still not doing it the way
>> it does it when searching all sensors - and it's still a lot slower.
>>
>>
>> Nested Loop (cost=2.36..1482129.06 rows=2629241 width=68)
>> (actual time=30983.301..31827.299 rows=772 loops=1)
>> Join Filter: ((main.ti > events.start) AND (main.ti < events.stop))
>> -> Seq Scan on main (cost=0.00..890547.50 rows=190832 width=40)
>> (actual time=54.095..18136.153 rows=492150 loops=1)
>> Filter: ((sensor_id + 0) = 1)
>> -> Materialize (cost=2.36..3.60 rows=124 width=28)
>> (actual time=0.000..0.009 rows=124 loops=492150)
>> -> Seq Scan on events (cost=0.00..2.24 rows=124 width=28)
>> (actual time=0.010..0.021 rows=124 loops=1)
>> Total runtime: 31827.435 ms
>>
>> I'm not sure the index is the problem - seems more like the way it
>> assembles the data.
>> If I read this right the planner takes the rows matching sensor_id=1
>> first and then joins the result with the time slices of events.
>>
>> This'd be probably a good idea if events had a lot of rows or the time
>> slices were large.
>> But with the data at hand it is a lot faster to take the rows
>> of events first and then match each one to main.ti (which is what
>> the planner does without the where clause).
>>
>> It's understandable that the planner cannot properly figure out
>> how many rows these start-stop slices will select, however since
>> it appears to make an assumption I was looking for a way to
>> tell it which value to assume...
>>
>> Regards
>>
>>
>>
>>
>> On Tue, 30 Mar 2010 12:06:46 -0600, Mladen Gogala <mgogala(at)vmsinfo(dot)com>
>> wrote:
>>
>>> There is an old trick which can help you here. Try doing this:
>>> SELECT * FROM events, main WHERE main.ti > events.start and
>>> main.ti < events.stop and
>>> sensor_id+0=1;
>>>
>>> That will stop planner from merging two indexes and using bitmap.
>>>
>>> L. Loewe wrote:
>>>> Hi all
>>>>
>>>>
>>>> I have a fairly large table ("main" ~50M rows) containing a timestamp
>>>> (indexed),
>>>> a sensor_id (indexed) and some sensor data ranging over a few years.
>>>>
>>>> The second table ("events" <500 rows) contains a set of interesting
>>>> events with
>>>> an events.id and two timestamps: events.start and events.stop plus
>>>> some
>>>> additional data.
>>>>
>>>> Now I want to join these tables to get the sensor data for those
>>>> events.
>>>>
>>>> The interval between start and stop is quite short for each event
>>>> (usually a
>>>> couple of minutes) so that there aren't too many rows from table
>>>> "main"
>>>> matching this criteria (~1K) for each event:
>>>>
>>>> SELECT * FROM events, main WHERE main.ti > events.start and main.ti <
>>>> events.stop;
>>>>
>>>> EXPLAIN ANALYZE gives
>>>>
>>>> Nested Loop (cost=0.00..27621542.27 rows=524505120 width=68)
>>>> (actual time=0.038..42.314 rows=69209 loops=1)
>>>> -> Seq Scan on events (cost=0.00..2.24 rows=124 width=28)
>>>> (actual time=0.006..0.025 rows=124 loops=1)
>>>> -> Index Scan using idx_main_ti on main (cost=0.00..159306.16
>>>> rows=4229880 width=40)
>>>> (actual time=0.016..0.178 rows=558 loops=124)
>>>> Index Cond: ((main.ti > events.start) AND (main.ti <
>>>> events.stop))
>>>> Total runtime: 47.682 ms
>>>>
>>>> So far so good, however if I add a sensor_id constraint the planner
>>>> chooses a
>>>> different approach:
>>>>
>>>>
>>>> SELECT * FROM events, main WHERE main.ti > events.start and
>>>> main.ti < events.stop and
>>>> sensor_id=1;
>>>>
>>>> Nested Loop (cost=7309.32..1422246.30 rows=4795865 width=68)
>>>> (actual time=23427.599..23886.276 rows=772 loops=1)
>>>> Join Filter: ((main.ti > events.start) AND (main.ti < events.stop))
>>>> -> Bitmap Heap Scan on main (cost=7306.96..343174.23 rows=348087
>>>> width=40)
>>>> (actual time=3771.719..9508.728 rows=490984 loops=1)
>>>> Recheck Cond: (sensor_id= 1)
>>>> -> Bitmap Index Scan on idx_main_sens (cost=0.00..7219.94
>>>> rows=348087 width=0)
>>>> (actual time=3769.075..3769.075 rows=491102 loops=1)
>>>> Index Cond: (sensor_id= 1)
>>>> -> Materialize (cost=2.36..3.60 rows=124 width=28)
>>>> (actual time=0.000..0.010 rows=124 loops=490984)
>>>> -> Seq Scan on events (cost=0.00..2.24 rows=124 width=28)
>>>> (actual time=0.005..0.021 rows=124 loops=1)
>>>> Total runtime: 23886.494 ms
>>>>
>>>>
>>>> Apparently the planner assumes that sensor_id=1 will return fewer
>>>> rows than
>>>> the time constraint while the opposite is true:
>>>> sensor_id=1 -> ~ 500K , time window -> ~ 1K.
>>>>
>>>> Is there a way to hint the planner to use plan 1 even with the
>>>> sensor_id=1
>>>> clause or am I doing something fundamentally wrong here?
>>>>
>>>> Thanks
>>>>
>>>
>>
>
>

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Mladen Gogala 2010-04-08 18:49:16 Re: FW: Postgres cpu & memory usage
Previous Message Mladen Gogala 2010-04-08 16:40:14 Re: slow plan on join when adding where clause