From: | Nicolas Paris <niparisco(at)gmail(dot)com> |
---|---|
To: | |
Cc: | pgsql-performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: PG 9.3 materialized view VS Views, indexes, shared memory |
Date: | 2015-03-09 13:17:48 |
Message-ID: | CA+ssMORdU8WZWHzPtEYXrcGzb9fnY3QnV+zoU29-T1yPv=6VmQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
(sorry for top-posting, gmail does not help.)
Thanks to your advice Jim, I have done an other test :
No partial indexes, just a partial index on boolean columns does the job.
(I get same perfs as MV)
CREATE INDEX ..ON (BoolColumnX) WHERE BoolColumnX IS TRUE
Then VIEW =
SELECT colA....colZ
FROM huge_table
WHERE BoolColumnX IS TRUE
Then this only index is used 800times (for each bool col) and saves place
as it does'nt indexes NULL values, and does no replicate. subsets. Moreover
the huge indexes are allways loaded in cache memory.
Nicolas PARIS
2015-03-06 21:26 GMT+01:00 Nicolas Paris <niparisco(at)gmail(dot)com>:
> According to this link
> http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html
> NULL values do not take place if only one other column are null for that
> row.
> Boolean takes 1 byte wheras smallint 2bytes.
> Then the space problem is not anymore a problem with boolean columns 95%
> empty
>
> One thing that is really great with postgresql is transaction for drop
> table cascade, that allow te restore all stuf index, views on a rollback if
> problem in loading appears.
> I hope using one transaction to drop/load many table is not a performance
> issue ?
>
> Nicolas PARIS
>
> 2015-03-06 11:25 GMT+01:00 Nicolas Paris <niparisco(at)gmail(dot)com>:
>
>> Thanks Jim,
>>
>> My suggestion is to test using nothing but plain views and plain indexes
>>> on the base table
>>
>> Actualy the way I choose subset rows for views is complicated in terms of
>> query. Then using simple views without partial indexes is terrible in terms
>> of performance (I have tested that).
>>
>> You mean the materialized views, right?
>>
>> Well I have tested matviews, views without partial indexes, views with
>> hashjoin on a key, ..
>>
>>
>>
>>> I think you'll ultimately be unhappy trying to go down this route, for
>>> the reasons you mention, plus the very large amount of extra space you'll
>>> be using. 800 booleans is 800 extra bytes for every row in your fact table.
>>> That's a lot. Even if you used a bitmap instead (which means you have to
>>> mess around with tracking which bit means what and probably other problems
>>> as well) you're still looking at 100 bytes per row. That's nothing to
>>> sneeze at.
>>
>>
>> Since each subset is about 5% (this number is decreasing when number of
>> views increase) of the fact table, most boolean rows are null. This means
>> 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk
>> load decrease csv size (true VS 1).
>>
>> For the views that are too slow, look at what the expensive part of the
>>> view and materialize *only that*.
>>
>> It would be great if I could, but all will be automatic, then It will be
>> difficult to apply such rules that demands human analyse, and manual
>> database modification, for one subset
>>
>>
>> Hope I have well anderstand you
>>
>>
>>
>> Nicolas PARIS
>>
>> 2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>:
>>
>>> On 3/6/15 2:16 AM, Nicolas Paris wrote:
>>>
>>>> If you have that many different views I doubt you want that many
>>>> indexes anyway.
>>>>
>>>> It's a datawarehouse, then each view is used by many user for each
>>>> query.
>>>> Those views must be subset of the huge material table. All indexes are
>>>> needed
>>>>
>>>
>>> Yes, but they don't have to be partial.
>>>
>>> Have you tried just hitting the base table and indexes directly,
>>>> either through plain views or just direct SQL?
>>>>
>>>> I have tried each. The performances are worst querying on a subset
>>>> (the views) than querying on whole huge table when using the huge
>>>> indexes
>>>>
>>>
>>> You mean the materialized views, right? If so, that makes sense: Instead
>>> of having all your users hitting one common set of data (your fact table)
>>> you had them hitting a bunch of other data (the mat views). But you still
>>> had other stuff hitting the fact table. So now you were dealing with a lot
>>> more data than if you just stuck to the single fact table.
>>>
>>> => this is the solution I am implementing. (800 is not true, but in 10
>>>> years it maybe will be)
>>>>
>>>
>>> In 10 years we'll all be using quantum computers anyway... ;P
>>>
>>> Actually, I have added a boolean column on the huge table for each
>>>> views. This is the way each view is a subset of huge table (Create View
>>>> as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then
>>>> I create 800partials indexes on that column(create index...WHERE columnX
>>>> is TRUE), for each view.
>>>> This works great as the query planer chooses the partials indexes when
>>>> querying the little subset of the terrific table (potential 20bilion
>>>> rows)
>>>>
>>>> This is better than material views for some reasons :
>>>> - saves places on hard drive (columnX is boolean +same indexes - data
>>>> for MatViews)
>>>> - saves time generating materialised views
>>>>
>>>
>>> But this isn't better than the mat views because of a bunch of booleans;
>>> it's better because it means less stain on the disk cache.
>>>
>>> This is quite more complicated because in the project, the number of
>>>> view is increasing, and dynamic then :
>>>> - then adding new mat views is simple
>>>> - adding new views => adding new column on the huge table. It can take
>>>> long time to update boolean for each tuple. Then I need to truncate/bulk
>>>> load all data each time I add a new View. Other problem is dynamic
>>>> number column table was a bit tricky to implement in an ETL soft such
>>>> Talend, but the benefits are I hope great.
>>>>
>>>
>>> I think you'll ultimately be unhappy trying to go down this route, for
>>> the reasons you mention, plus the very large amount of extra space you'll
>>> be using. 800 booleans is 800 extra bytes for every row in your fact table.
>>> That's a lot. Even if you used a bitmap instead (which means you have to
>>> mess around with tracking which bit means what and probably other problems
>>> as well) you're still looking at 100 bytes per row. That's nothing to
>>> sneeze at.
>>>
>>> My suggestion is to test using nothing but plain views and plain indexes
>>> on the base table. I expect that some of those views will not perform
>>> adequately, but many (or most) of them will be fine. For the views that are
>>> too slow, look at what the expensive part of the view and materialize *only
>>> that*. I suspect you'll find that when you do that you'll discover that
>>> several views are slow because of the same thing, so if you materialize
>>> that one thing one time you can then use it to speed up several views.
>>>
>>> Using that approach means you'll have a lot less data that you have to
>>> read.
>>>
>>> --
>>> Jim Nasby, Data Architect, Blue Treble Consulting
>>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>>>
>>
>>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2015-03-10 08:31:29 | Re: PG 9.3 materialized view VS Views, indexes, shared memory |
Previous Message | Guillaume Lelarge | 2015-03-08 09:09:15 | Re: CREATE INDEX uses INDEX ? |