Re: star schema and the optimizer

From: Marc Cousin <cousinmarc(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: star schema and the optimizer
Date: 2015-02-28 08:03:52
Message-ID: 54F17668.902@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27/02/2015 20:01, Marc Cousin wrote:
> On 27/02/2015 19:45, Tom Lane wrote:
>>> I wrote:
>>>> I had actually thought that we'd fixed this type of problem in recent
>>>> versions, and that you should be able to get a plan that would look
>>>> like
>>
>>>> Nestloop
>>>> -> scan dim1
>>>> -> Nestloop
>>>> -> scan dim2
>>>> -> indexscan fact table using dim1.a and dim2.b
>>
>> After closer study, I think this is an oversight in commit
>> e2fa76d80ba571d4de8992de6386536867250474, which quoth
>>
>> +It can be useful for the parameter value to be passed down through
>> +intermediate layers of joins, for example:
>> +
>> + NestLoop
>> + -> Seq Scan on A
>> + Hash Join
>> + Join Condition: B.Y = C.W
>> + -> Seq Scan on B
>> + -> Index Scan using C_Z_IDX on C
>> + Index Condition: C.Z = A.X
>> +
>> +If all joins are plain inner joins then this is unnecessary, because
>> +it's always possible to reorder the joins so that a parameter is used
>> +immediately below the nestloop node that provides it. But in the
>> +presence of outer joins, join reordering may not be possible, and then
>> +this option can be critical. Before version 9.2, Postgres used ad-hoc
>>
>> This reasoning overlooked the fact that if we need parameters from
>> more than one relation, and there's no way to join those relations
>> to each other directly, then we have to allow passing the dim1 parameter
>> down through the join to dim2.
>>
>> The attached patch seems to fix it (modulo the need for some updates
>> in the README, and maybe a regression test). Could you see if this
>> produces satisfactory plans for you?
>
>
> From what I see, it's just perfect. I'll give it a more thorough look a
> bit later, but it seems to be exactly what I was waiting for.
>
> Thanks a lot.
>
> Regards

I gave it another look this morning. It works very well with the initial test schema. The situation is much improved for me.

I still have one issue: I've extended the test to more than 2 dimensions.

marc=# explain analyze SELECT * FROM dim1,dim2,dim3,dim4,facts WHERE facts.dim1=dim1.a and facts.dim2=dim2.a and facts.dim3=dim3.a and facts.dim4=dim4.a and dim1.b between 10 and 60 AND dim2.b between 30 and 50 and dim3.b=8526 and dim4.b between 70 and 90;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=15.00..957710.99 rows=1 width=200) (actual time=793.247..793.247 rows=0 loops=1)
-> Nested Loop (cost=14.71..957704.75 rows=1 width=192) (actual time=793.245..793.245 rows=0 loops=1)
Join Filter: (facts.dim3 = dim3.a)
Rows Removed by Join Filter: 942
-> Index Scan using idxdim32 on dim3 (cost=0.29..3300.29 rows=10 width=8) (actual time=49.498..67.923 rows=1 loops=1)
Filter: (b = 8526)
Rows Removed by Filter: 99999
-> Materialize (cost=14.41..954262.56 rows=962 width=184) (actual time=0.552..724.988 rows=942 loops=1)
-> Nested Loop (cost=14.41..954257.75 rows=962 width=184) (actual time=0.542..723.380 rows=942 loops=1)
-> Index Scan using idxdim22 on dim2 (cost=0.29..3648.29 rows=192 width=16) (actual time=0.074..47.445 rows=229 loops=1)
Filter: ((b >= 30) AND (b <= 50))
Rows Removed by Filter: 99771
-> Nested Loop (cost=14.12..4946.08 rows=501 width=168) (actual time=2.575..2.946 rows=4 loops=229)
-> Bitmap Heap Scan on dim1 (cost=13.43..573.60 rows=501 width=16) (actual time=0.170..0.647 rows=509 loops=229)
Recheck Cond: ((b >= 10) AND (b <= 60))
Heap Blocks: exact=78547
-> Bitmap Index Scan on idxdim11 (cost=0.00..13.30 rows=501 width=0) (actual time=0.112..0.112 rows=509 loops=229)
Index Cond: ((b >= 10) AND (b <= 60))
-> Index Scan using idxfacts on facts (cost=0.69..8.72 rows=1 width=152) (actual time=0.004..0.004 rows=0 loops=116561)
Index Cond: ((dim1 = dim1.a) AND (dim2 = dim2.a))
-> Index Scan using idxdim42 on dim4 (cost=0.29..6.24 rows=1 width=8) (never executed)
Index Cond: (a = facts.dim4)
Filter: ((b >= 70) AND (b <= 90))
Planning time: 8.092 ms
Execution time: 793.621 ms
(25 lignes)

marc=# \d facts
Table « public.facts »
Colonne | Type | Modificateurs
---------+--------+---------------
dim1 | bigint |
dim2 | bigint |
dim3 | bigint |
dim4 | bigint |
dim5 | bigint |
dim6 | bigint |
dim7 | bigint |
dim8 | bigint |
dim9 | bigint |
dim10 | bigint |
data1 | text |
data2 | text |
data3 | text |
data4 | text |
data5 | text |
data6 | text |
Index :
"idxfacts" btree (dim1, dim2, dim3, dim4, dim5, dim6, dim7, dim8, dim9, dim10)

I hoped that the dim3=dim3.a condition could be used in the index scan on facts (I have chosen the 8526 value because it only has one matching row). Maybe that's not possible, or not efficient, but I thought I'd rather ask.

Regards.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paolo Losi 2015-02-28 09:08:30 pushing order by + limit to union subqueries
Previous Message Vadim Gribanov 2015-02-28 07:34:20 Re: Docs about shared memory