From: | Nicolas Paris <niparisco(at)gmail(dot)com> |
---|---|
To: | |
Cc: | pgsql-performance <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: PG 9.3 materialized view VS Views, indexes, shared memory |
Date: | 2015-03-06 20:26:27 |
Message-ID: | CA+ssMOT54r9Q349ZBRhuVg6n-3-cp0TpG8Piy5qCTLfumO_Yog@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
According to this link
http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html
NULL values do not take place if only one other column are null for that
row.
Boolean takes 1 byte wheras smallint 2bytes.
Then the space problem is not anymore a problem with boolean columns 95%
empty
One thing that is really great with postgresql is transaction for drop
table cascade, that allow te restore all stuf index, views on a rollback if
problem in loading appears.
I hope using one transaction to drop/load many table is not a performance
issue ?
Nicolas PARIS
2015-03-06 11:25 GMT+01:00 Nicolas Paris <niparisco(at)gmail(dot)com>:
> Thanks Jim,
>
> My suggestion is to test using nothing but plain views and plain indexes
>> on the base table
>
> Actualy the way I choose subset rows for views is complicated in terms of
> query. Then using simple views without partial indexes is terrible in terms
> of performance (I have tested that).
>
> You mean the materialized views, right?
>
> Well I have tested matviews, views without partial indexes, views with
> hashjoin on a key, ..
>
>
>
>> I think you'll ultimately be unhappy trying to go down this route, for
>> the reasons you mention, plus the very large amount of extra space you'll
>> be using. 800 booleans is 800 extra bytes for every row in your fact table.
>> That's a lot. Even if you used a bitmap instead (which means you have to
>> mess around with tracking which bit means what and probably other problems
>> as well) you're still looking at 100 bytes per row. That's nothing to
>> sneeze at.
>
>
> Since each subset is about 5% (this number is decreasing when number of
> views increase) of the fact table, most boolean rows are null. This means
> 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk
> load decrease csv size (true VS 1).
>
> For the views that are too slow, look at what the expensive part of the
>> view and materialize *only that*.
>
> It would be great if I could, but all will be automatic, then It will be
> difficult to apply such rules that demands human analyse, and manual
> database modification, for one subset
>
>
> Hope I have well anderstand you
>
>
>
> Nicolas PARIS
>
> 2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>:
>
>> On 3/6/15 2:16 AM, Nicolas Paris wrote:
>>
>>> If you have that many different views I doubt you want that many
>>> indexes anyway.
>>>
>>> It's a datawarehouse, then each view is used by many user for each
>>> query.
>>> Those views must be subset of the huge material table. All indexes are
>>> needed
>>>
>>
>> Yes, but they don't have to be partial.
>>
>> Have you tried just hitting the base table and indexes directly,
>>> either through plain views or just direct SQL?
>>>
>>> I have tried each. The performances are worst querying on a subset
>>> (the views) than querying on whole huge table when using the huge indexes
>>>
>>
>> You mean the materialized views, right? If so, that makes sense: Instead
>> of having all your users hitting one common set of data (your fact table)
>> you had them hitting a bunch of other data (the mat views). But you still
>> had other stuff hitting the fact table. So now you were dealing with a lot
>> more data than if you just stuck to the single fact table.
>>
>> => this is the solution I am implementing. (800 is not true, but in 10
>>> years it maybe will be)
>>>
>>
>> In 10 years we'll all be using quantum computers anyway... ;P
>>
>> Actually, I have added a boolean column on the huge table for each
>>> views. This is the way each view is a subset of huge table (Create View
>>> as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then
>>> I create 800partials indexes on that column(create index...WHERE columnX
>>> is TRUE), for each view.
>>> This works great as the query planer chooses the partials indexes when
>>> querying the little subset of the terrific table (potential 20bilion
>>> rows)
>>>
>>> This is better than material views for some reasons :
>>> - saves places on hard drive (columnX is boolean +same indexes - data
>>> for MatViews)
>>> - saves time generating materialised views
>>>
>>
>> But this isn't better than the mat views because of a bunch of booleans;
>> it's better because it means less stain on the disk cache.
>>
>> This is quite more complicated because in the project, the number of
>>> view is increasing, and dynamic then :
>>> - then adding new mat views is simple
>>> - adding new views => adding new column on the huge table. It can take
>>> long time to update boolean for each tuple. Then I need to truncate/bulk
>>> load all data each time I add a new View. Other problem is dynamic
>>> number column table was a bit tricky to implement in an ETL soft such
>>> Talend, but the benefits are I hope great.
>>>
>>
>> I think you'll ultimately be unhappy trying to go down this route, for
>> the reasons you mention, plus the very large amount of extra space you'll
>> be using. 800 booleans is 800 extra bytes for every row in your fact table.
>> That's a lot. Even if you used a bitmap instead (which means you have to
>> mess around with tracking which bit means what and probably other problems
>> as well) you're still looking at 100 bytes per row. That's nothing to
>> sneeze at.
>>
>> My suggestion is to test using nothing but plain views and plain indexes
>> on the base table. I expect that some of those views will not perform
>> adequately, but many (or most) of them will be fine. For the views that are
>> too slow, look at what the expensive part of the view and materialize *only
>> that*. I suspect you'll find that when you do that you'll discover that
>> several views are slow because of the same thing, so if you materialize
>> that one thing one time you can then use it to speed up several views.
>>
>> Using that approach means you'll have a lot less data that you have to
>> read.
>>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2015-03-07 01:24:42 | Re: EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT |
Previous Message | Tom Lane | 2015-03-06 18:36:01 | Re: EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT |