Re: PG 9.3 materialized view VS Views, indexes, shared memory

From: Nicolas Paris <niparisco(at)gmail(dot)com>
To:
Cc: pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: PG 9.3 materialized view VS Views, indexes, shared memory
Date: 2015-03-06 20:26:27
Message-ID: CA+ssMOT54r9Q349ZBRhuVg6n-3-cp0TpG8Piy5qCTLfumO_Yog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

According to this link
http://postgresql.nabble.com/NULL-saves-disk-space-td4344106.html
NULL values do not take place if only one other column are null for that
row.
Boolean takes 1 byte wheras smallint 2bytes.
Then the space problem is not anymore a problem with boolean columns 95%
empty

One thing that is really great with postgresql is transaction for drop
table cascade, that allow te restore all stuf index, views on a rollback if
problem in loading appears.
I hope using one transaction to drop/load many table is not a performance
issue ?

Nicolas PARIS

2015-03-06 11:25 GMT+01:00 Nicolas Paris <niparisco(at)gmail(dot)com>:

> Thanks Jim,
>
> My suggestion is to test using nothing but plain views and plain indexes
>> on the base table
>
> Actualy the way I choose subset rows for views is complicated in terms of
> query. Then using simple views without partial indexes is terrible in terms
> of performance (I have tested that).
>
> You mean the materialized views, right?
>
> ​Well I have tested matviews, views without partial indexes, views with
> hashjoin on a key, ..
> ​
>
>
>> I think you'll ultimately be unhappy trying to go down this route, for
>> the reasons you mention, plus the very large amount of extra space you'll
>> be using. 800 booleans is 800 extra bytes for every row in your fact table.
>> That's a lot. Even if you used a bitmap instead (which means you have to
>> mess around with tracking which bit means what and probably other problems
>> as well) you're still looking at 100 bytes per row. That's nothing to
>> sneeze at.
>
>
> ​Since each subset is about 5% (this number is decreasing when number of
> views increase) of the fact table, most boolean rows are null. This means
> 5% of 800 extra bytes, right ? I have choosen smallint, because of bulk
> load decrease csv size (true VS 1).
>
> For the views that are too slow, look at what the expensive part of the
>> view and materialize *only that*.
>
> It would be great if I could, but all will be automatic, then It will be
> difficult to apply such rules that demands human analyse, and manual
> database modification, for one subset
>
>
> Hope I have well anderstand you
> ​
>
>
> Nicolas PARIS
>
> 2015-03-06 10:32 GMT+01:00 Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>:
>
>> On 3/6/15 2:16 AM, Nicolas Paris wrote:
>>
>>> If you have that many different views I doubt you want that many
>>> indexes anyway.
>>>
>>> ​It's a datawarehouse, then each view is used by many user for each
>>> query.
>>> Those views must be subset of the huge material table. All indexes are
>>> needed
>>>
>>
>> Yes, but they don't have to be partial.
>>
>> ​Have you tried just hitting the base table and indexes directly,
>>> either through plain views or just direct SQL?
>>>
>>> ​ I have tried each. The performances are worst querying on a subset
>>> (the views) than querying on whole huge table when using the huge indexes
>>>
>>
>> You mean the materialized views, right? If so, that makes sense: Instead
>> of having all your users hitting one common set of data (your fact table)
>> you had them hitting a bunch of other data (the mat views). But you still
>> had other stuff hitting the fact table. So now you were dealing with a lot
>> more data than if you just stuck to the single fact table.
>>
>> => this is the solution I am implementing. (800 is not true, but in 10
>>> years it maybe will be)
>>>
>>
>> In 10 years we'll all be using quantum computers anyway... ;P
>>
>> ​ Actually, I have added a boolean column on the huge table for each
>>> views​. This is the way each view is a subset of huge table (Create View
>>> as Select * FROM hugeTable WHERE columnX is true --etc 800 times). Then
>>> I create 800partials indexes on that column(create index...WHERE columnX
>>> is TRUE), for each view.
>>> This works great as the query planer chooses the partials indexes when
>>> querying the little subset of the terrific table (potential 20bilion
>>> rows)
>>>
>>> This is better than material views for some reasons :
>>> - saves places on hard drive (columnX is boolean +same indexes - data
>>> for MatViews)
>>> - saves time generating materialised views
>>>
>>
>> But this isn't better than the mat views because of a bunch of booleans;
>> it's better because it means less stain on the disk cache.
>>
>> This is quite more complicated because in the project, the number of
>>> view is increasing, and dynamic then :
>>> - then adding new mat views is simple
>>> - adding new views => adding new column on the huge table. It can take
>>> long time to update boolean for each tuple. Then I need to truncate/bulk
>>> load all data each time I add a new View. Other problem is dynamic
>>> number column table was a bit tricky to implement in an ETL soft such
>>> Talend, but the benefits are I hope great.
>>>
>>
>> I think you'll ultimately be unhappy trying to go down this route, for
>> the reasons you mention, plus the very large amount of extra space you'll
>> be using. 800 booleans is 800 extra bytes for every row in your fact table.
>> That's a lot. Even if you used a bitmap instead (which means you have to
>> mess around with tracking which bit means what and probably other problems
>> as well) you're still looking at 100 bytes per row. That's nothing to
>> sneeze at.
>>
>> My suggestion is to test using nothing but plain views and plain indexes
>> on the base table. I expect that some of those views will not perform
>> adequately, but many (or most) of them will be fine. For the views that are
>> too slow, look at what the expensive part of the view and materialize *only
>> that*. I suspect you'll find that when you do that you'll discover that
>> several views are slow because of the same thing, so if you materialize
>> that one thing one time you can then use it to speed up several views.
>>
>> Using that approach means you'll have a lot less data that you have to
>> read.
>>
>> --
>> Jim Nasby, Data Architect, Blue Treble Consulting
>> Data in Trouble? Get it in Treble! http://BlueTreble.com
>>
>
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2015-03-07 01:24:42 Re: EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT
Previous Message Tom Lane 2015-03-06 18:36:01 Re: EXPLAIN (no ANALYZE) taking an hour for INSERT FROM SELECT