Re: index bloat

From: "David Esposito" <pgsql-general(at)esposito(dot)newnetco(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: index bloat
Date: 2005-07-13 20:53:25
Message-ID: 200507132053.j6DKrOs8003872@relay2.nnco.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Wednesday, July 13, 2005 3:45 PM
>
> Hmm, this is preferentially touching stuff near the right end of the
> index, ie, it's going to bloat the pages associated with higher keys.
> As I understand your usage of these indexes, pages generally only get
> reclaimed off the left end (as records get old enough to be moved to
> archival storage). So if you were to repeat this test for long enough
> for the data to cycle all the way through the table and out again
> (50 days in your real-world usage) then the extra space would be
> evenly distributed and the usage would reach a steady state.

So I guess the test of this would be whether the index usage increases after
the 50 day mark passes ... Our last REINDEX was on June 5th so that should
mean that things would level off on July 25th ...

What you're saying is that when VACUUM runs nightly, it clears out all of
the dead row versions from the indexes ... but since the bulk of my dead row
versions are all clustered around the right side of the index, that means
that those pages become very sparsely populated with data ... as the index
ages (assuming time-zero is the day I did my REINDEX) the index becomes
exclusively populated with those sparse pages ... and the way new keys are
inserted into the index is to always add them to a new page (where the 'new'
page is either a truly new page, or a page that is completely empty), rather
than using up some of the fragmented space within existing pages?

and this must differ from the way that space is allocated for the actual
table record otherwise i would be witnessing the same bloat with the table
entity itself?

out of curiosity, why is it necessary for MVCC to create copies of the index
keys if the value of the indexed field doesn't change with an UPDATE to the
underlying record? (perhaps answering my own question: because the entry in
the index is the key and the rowid that it's pointing to .. so when you
UPDATE the row, you have to create a new index entry to point at the new
rowid)

>
> The testing I've been doing so far involves UPDATEs that touch a
> uniformly distributed subset of the table --- maybe that's the aspect
> that is failing to match your reality. Do you mostly update
> recently-added rows? Can you quantify the effect at all?

This is probably a good point ... The histogram of UPDATE activity to the
table probably looks something like

90% - records created < 24 hours ago
7% - records created 1 - 2 days ago
2% - records created 2 - 7 days ago
1% - records older than 7 days

I don't have a way to easily quantitatively confirm this, but based on what
we see in our reports and what we recommend to customers (as far as when
they can consider a mailing "closed"), we usually recommend considering the
48-hour mark as the end of the vast majority of the activity ..

Thanks again,
Dave

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Richard_D_Levine 2005-07-13 20:53:26 Re: chosing a database name
Previous Message Bob 2005-07-13 20:50:37 Re: To Postgres or not