Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER

From: Erik Jones <erik(at)myemma(dot)com>
To: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
Cc: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER
Date: 2007-09-19 15:17:00
Message-ID: D5862DAF-FD0F-49FF-9338-6B1E4654293F@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sep 19, 2007, at 9:29 AM, Richard Broersma Jr wrote:

> --- Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
>
>> Then I am confused again about how the fill factor stuff works. Let's
>> say I have a table with four BTREE indexes. Should all of them have a
>> fill factor of about 60 (lower than the default 90, that is) to be
>> effective? Or will it help if I lower the fill factor on only a
>> couple
>> of the most often used ones? The primary key index is very, very
>> rarely updated so I don't need it to have a fill factor.
>>
>> I could try and see these one by one, but that's no better than
>> touching/feeling the database blind-folded. I would love to see some
>> writeup about this whole thing, but it seems hard to come by!
>
> I will try to explain everything that I understand about indexes
> and tables. I am sure that if
> some part of my understanding is incorrect, someone will chime in
> to correct me.
>
> In PostgreSQL, tables are physically distinct from indexes. This
> means that any give table is
> written to disk as a file(s), and indexes are also written to disk
> as a separate file(s). A table
> and index are both saved to disk in segmented block referred to a
> pages (I believe the default
> size is 8K).

Yes, that can be changed at compile time, although I don't think
I've ever heard of any advantages to doing that.

>
> The advantage of the index file is that it is significantly smaller
> in size, so it takes less time
> to sequentially scan and less time to read from disk.
>
> Now when you want to find a record, PostgreSQL will/(may choose to)
> sequentially scan the index
> until it find the record entry that corresponds with your
> criteria. This entry has a table
> cross-reference to the actual page that contains the record that is
> "pointed" at by the index.
> Lastly, the entire table page containing your record is read from
> disk to memory for further query
> processing.
>
> <ASSUMPTION>
> When you insert a record into a table that generates an entry into
> the b-tree index file,
> PostgreSQL will scan the pages of the index file to find the
> correct place and index page to add
> this entry. If the page is already full, PostgreSQL "probably"
> replaces the old full pages with
> two new pages with a distribution of that chunk of the B-tree
> index, and then adds the new entry
> to one of those pages. This operation can become very expensive if
> many new pages need to be
> created from single INSERT/UPDATE statement.
> </ASSUMPTION>

Yes, the point of B-trees is that they have fast lookup times, but
updates can be expensive when you have to re-balance your leaf nodes.

> By using fill factor, you are telling PostgreSQL to automatically
> leave a portion of any newly
> created index page partially blank for future use. When a newly
> created index entry needs to be
> saved, it can be stored in one of the "holes" left in the index page.

That future use is only for updates.

>
> A large fill factor create both advantages and dis-advantages. For
> writing operations, it is a
> big advantage because, a large fill factor will leave alot of holes
> for INSERT and UPDATE
> operations to use. This can help increase the number of UPDATE/
> INSERT per seconds that you server
> can handle since, they index pages do not have to be rebuilt very
> often.

Not quite. Once a page has reached it's fill factor percentage full,
no more inserts will happen on that page, only updates. Also, I
think you have large/small backwards wrt fill factor. If you have a
fill factor of, say, 40% then once a page has reached 40% full no
more inserts will happen (unless some space is reclaimed by vacuum).
So, smaller fill factors == bigger holes. The bigger the fill
factor, the smaller the whole: if you have a fill factor of 90%,
only 10% is reserved for updates of rows on that page.

>
> However, the disadvantage is that, a newly created index with a
> large fill factor has "index
> bloat" designed into it. This mean that the index pages have a
> large portion of holes. So
> PostgreSQL will create more index pages than it would normally in
> order to hold both your index
> and the pre-defined holes that you specified by your fill-factor.
> Larger indexes require more
> time to be read from disk to memory and will require more time to
> sequentially scan to find to
> find the cross-reference table page location of records of
> interest. So the net effect is that
> larger indexes will make SELECT statement slower.
>
> This is my understanding for tables indexes and fill factor. I hope
> it helps.

Again, with the large v. small fill factor point. Using fill factor
seems to be a trade-off between space and update efficiency. Let's
say that after so many (potential) updates you know that each row
will become static, i.e. no more updates will happen. Let's use some
numbers and make them easy to work with. Say each row will be
updated exactly once and you use a fill factor of 50%. Now, say 5K
of fresh index data is written. The first 4K will go into one page
at which point it has hit it's 50% fill factor threshold and the
other 1K has to go into another page. Now, let's say each tuple in
the index is updated, the first page is now at 100% full as all of
the updates for the first page filled up the other 50%. The page is
now full until you vacuum. However, once you vacuum and the first 4K
that was originally inserted is freed, that 4K will never get used
again as the page is back down to 50% so no more inserts can happen
on that page, and we know that each tuple would only be updated that
once and you're left with 50% "bloat". So, you can see that for
tables/indexes that aren't going to see a lot of updates to the same
tuples something as low as 50% is probably a pretty aggressive fill
factor. In fact, I would say that you shouldn't really even bother
playing around with the fill factor unless you know that the table/
index sees a lot of updates. For my example where each tuple sees
only one update, the index default fill factor of 90% is probably fine.

Erik Jones

Software Developer | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Richard Broersma Jr 2007-09-19 15:30:15 Re: For index bloat: VACUUM ANALYZE vs REINDEX/CLUSTER
Previous Message Scott Marlowe 2007-09-19 15:06:36 Re: Q:Aggregrating Weekly Production Data. How do you do it?