Re: [HACKERS] mdnblocks is an amazing time sink in huge relations

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
Date: 1999-10-19 03:10:54
Message-ID: 1689.940302654@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Hiroshi Inoue" <Inoue(at)tpf(dot)co(dot)jp> writes:
>> a shared cache for system catalog tuples, which might be a win but I'm
>> not sure (I'm worried about contention for the cache, especially if it's
>> protected by just one or a few spinlocks). Anyway, if we did have one
>> then keeping an accurate block count in the relation's pg_class row
>> would be a practical alternative.

> But there would be a problem if we use shared catalog cache.
> Being updated system tuples are only visible to an updating backend
> and other backends should see committed tuples.
> On the other hand,an accurate block count should be visible to all
> backends.
> Which tuple of a row should we load to catalog cache and update ?

Good point --- rolling back a transaction would cancel changes to the
pg_class row, but it mustn't cause the relation's file to get truncated
(since there could be tuples of other uncommitted transactions in the
newly added block(s)).

This says that having a block count column in pg_class is the Wrong
Thing; we should get rid of relpages entirely. The Right Thing is a
separate data structure in shared memory that stores the current
physical block count for each active relation. The first backend to
touch a given relation would insert an entry, and then subsequent
extensions/truncations/deletions would need to update it. We already
obtain a special lock when extending a relation, so seems like there'd
be no extra locking cost to have a table like this.

Anyone up for actually implementing this ;-) ? I have other things
I want to work on...

>> Well, it seems to me that the first misbehavior (incomplete delete becomes
>> a partial truncate, and you can try again) is a lot better than the
>> second (incomplete delete leaves an undeletable, unrecreatable table).
>> Should I go ahead and make delete/truncate work back-to-front, or do you
>> see a reason why that'd be a bad thing to do?

> I also think back-to-front is better.

OK, I have a couple other little things I want to do in md.c, so I'll
see what I can do about that. Even with a shared-memory relation
length table, back-to-front truncation would be the safest way to
proceed, so we'll want to make this change in any case.

> Deletion is necessary only not to consume disk space.
>
> For example vacuum could remove not deleted files.

Hmm ... interesting idea ... but I can hear the complaints
from users already...

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 1999-10-19 03:17:37 Re: [HACKERS] sort on huge table
Previous Message Hiroshi Inoue 1999-10-19 01:02:42 RE: [HACKERS] mdnblocks is an amazing time sink in huge relations