From: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: [PATCHES] Cleaning up unreferenced table files |
Date: | 2005-05-10 20:29:22 |
Message-ID: | Pine.OSF.4.61.0505102211560.368341@kosh.hut.fi |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers pgsql-patches |
On Sun, 8 May 2005, Tom Lane wrote:
> While your original patch is buggy, it's at least fixable and has
> localized, limited impact. I don't think these schemes are safe
> at all --- they put a great deal more weight on the semantics of
> the filesystem than I care to do.
I'm going to try this some more, because I feel that a scheme like this
that doesn't rely on scanning pg_class and the file system would in fact
be safer.
The key is to A) obey the "WAL first" rule, and A) remember information
about file creations over a checkpoint. The problem with the my previous
suggestion was that it didn't reliably accomplish either :).
Right now we break the WAL rule because the file creation is recorded
after the file is created. And the record is not flushed.
The trivial way to fix that is to write and flush the xlog record before
actually creating the file. (for a more optimized way to do it, see end of
message). Then we could trust that there aren't any files in the data
directory that don't have a corresponding record in WAL.
But that's not enough. If a checkpoint occurs after the file is
created, but before the transaction ends, WAL replay doesn't see the file
creation record. That's why we need a mechanism to carry the information
over the checkpoint.
We could do that by extending the ForwardFsyncRequest function or by
creating something similar to that. When a backend writes the file
creation WAL record, it also sends a message to the bgwriter that says
"I'm xid 1234, and I have just created file foobar/1234" (while holding
CheckpointStartLock). Bgwriter keeps a list of xid/file pairs like it
keeps a list of pending fsync operations. On checkpoint, the checkpointer
scans the list and removes entries for transactions that have already
ended, and attaches the remaining list to the checkpoint record.
WAL replay would start with the xid/file list in the checkpoint record,
and update it during the replay whenever a file creation or a transaction
commit/rollback record is seen. On a rollback record, files created by
that transaction are deleted. At the end of WAL replay, the files that are
left in the list belong to transactions that implicitly aborted, and can
be deleted.
If we don't want to extend the checkpoint record, a separate WAL record
works too.
Now, the more optimized way to do A:
Delay the actual file creation until it's first written to. The write
needs to be WAL logged anyway, so we would just piggyback on that.
Implemented this way, I don't think there would be a significant
performance hit from the scheme. We would create more ForwardFsyncRequest
traffic, but not much compared to the block fsync requests we have right
now.
BTW: If we allowed mdopen to create the file if it doesn't exist already,
would we need the current file creation xlog record for anything? (I'm
not suggesting to do that, just trying to get more insight)
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2005-05-10 20:55:45 | Re: [PATCHES] Cleaning up unreferenced table files |
Previous Message | David Walker | 2005-05-10 20:26:18 | Re: Can we get patents? |
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2005-05-10 20:55:45 | Re: [PATCHES] Cleaning up unreferenced table files |
Previous Message | Neil Conway | 2005-05-10 06:49:28 | Re: cleanup: remove MemSet() casts |