Re: clustering without locking

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: clustering without locking
Date: 2008-05-03 18:19:14
Message-ID: 481CACA2.9040707@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote:
> Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
>> Later on, though, less new space would have to be allocated because more
>> and more of the space allocated earlier to hold moved tuples would be
>> being freed up in useful chunks that could be reused.
>
> I don't see how that works. If the minimum size of the table is X
> pages, ISTM that the first pass has to push everything up to pages above
> X.

What I was suggesting was to essentially cluster the table in small
parts. Rather than two huge passes (move all tuples to free / newly
allocated space at end of table ; move back into old locations in order)
it'd be done in a series of smaller moves.

After moving a chunk out of the way and into free/new space at the end
of the table data would be copied from later in the table into the freed
space. That space could then be re-used to hold data from the next chunk
that needs to be moved out of the way.

I'm just trying to understand if it can actually work. Sorry if my
attempted explanations are unclear; I'm probably doing a terrible job,
and it's probably actually a stupid idea anyway (if nothing else it
might just be too slow). Nonetheless I'm curious. Maybe I can explain
another way.

Visually:

`0' to `9' : tuples. Desired eventual cluster order is face value.
`.' : Dead/obsoleted tuple not yet marked reusable by VACUUM
` ' : free space

Initial state:

---------
584736120
---------

Begin a transaction and free the first chunk (2 tuples in this case, but
obviously many more in a real case):

-----------
..473612058
-----------

Use that freed space to store the first ordered tuples:

-----------
014736.2.58
-----------

Commit, and when the last transaction to which the "." tuples above are
still visible completes mark them as free with VACUUM or similar.

-----------
014736 2 58
-----------

Repeat, but now use the freed space to store the next set of tuples that
must be moved rather than extending the table:

-----------
01..3642758
-----------
-----------
0123.64.758
-----------
-----------
0123 64 758
-----------

During the next pass someone inserts `9' after tuples have been moved to
make a hole and others have been moved into the hole, but before the
old locations of the moved tuples are marked as free:

-----------
0123 .46758
-----------
-----------
012345.67.8
-----------
------------
012345.67.89 <- INSERT 9
------------
------------
012345 67 89
------------

You'd never land up with this sort of convenient ordering half way
through in a real case with realistic numbers of tuples, so it'd keep
going, doing small chunks each time, until the whole table had been
processed.

So, the table does grow, and its final state does contain dead space at
the end, but not *too* much of it:

------------
0123456789
------------

If "low" values are inserted late in the progressive cluster they can
just stay at the end of the table. They'll get moved if the user runs a
progressive cluster operation again later. However, since you're ideally
doing this with a non-100% fillfactor, they should land up in roughly
the right place on initial insert rather than at the end of the table,
avoiding the issue (and helping avoid having newly inserted tuples
prevent table truncation by vacuum when the progressive clustering
finishes).

Say a table containing the range 1-9 was being clustered with a 50%
fillfactor and was about half way through the process:

-------------
1 2 3 47986
-------------

and someone inserts `0' it should ideally land up in the right place
anyway (as a bit of reading suggests that insert tries to respect
cluster order):

-------------
01 2 3 47986
-------------

If you're re-clustering a table with a fillfactor set you may not need
to extend it at all, because you can use already allocated free space to
store tuples temporarily while moving them around.

> You can't put any temporary copies in pages <= X because you might
> need that space when it comes time to make the clustering happen. So
> the table is going to bloat to (at least) 2X pages.

Yes, if you perform the whole process in a single huge chunk. What I was
hoping was that it was possible to do it in a series of smaller
operations instead, avoiding the need for such a huge amount of bloat at
the end of the table.

Say you're clustering in 10 steps. You need X*0.1 worth of scratch space
created by extending the table if there's not already appropriate free
space late in the table. Assuming you can reclaim all space you've used
for temporary copies of tuples after each pass your ideal final table
size is X*1.1+(space used by new inserts), of which X*0.1 is free space
at the end of the table. That free space probably has scattered rows
from concurrent inserts, but if you're using a non-100% fillfactor it
might not.

It seems like it should work, *if* space used for temporary storage can
be efficiently reclaimed for reuse (or new inserts) after each pass.
Even if it can work, though, I don't know if it'd actually be useful in
the face of the effect on indexes and because of how slow it might land
up being.

--
Craig Ringer

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2008-05-03 18:27:54 Re: Executing dynamic procedure call
Previous Message Steve Atkins 2008-05-03 18:17:00 Re: large query by offset and limt