From: | Jesper Krogh <jesper(at)krogh(dot)cc> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: mosbench revisited |
Date: | 2011-08-08 15:11:12 |
Message-ID: | 4E3FFC90.5050708@krogh.cc |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2011-08-08 15:29, Robert Haas wrote:
> On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine<dimitri(at)2ndquadrant(dot)fr> wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> It would be nice if the Linux guys would fix this problem for us, but
>>> I'm not sure whether they will. For those who may be curious, the
>>> problem is in generic_file_llseek() in fs/read-write.c. On a platform
>>> with 8-byte atomic reads, it seems like it ought to be very possible
>>> to read inode->i_size without taking a spinlock. A little Googling
>>> around suggests that some patches along these lines have been proposed
>>> and - for reasons that I don't fully understand - rejected. That now
>>> seems unfortunate. Barring a kernel-level fix, we could try to
>>> implement our own cache to work around this problem. However, any
>>> such cache would need to be darn cheap to check and update (since we
>>> can't assume that relation extension is an infrequent event) and must
>>> somehow having the same sort of mutex contention that's killing the
>>> kernel in this workload.
>> What about making the relation extension much less frequent? It's been
>> talked about before here, that instead of extending 8kB at a time we
>> could (should) extend by much larger chunks. I would go as far as
>> preallocating the whole next segment (1GB) (in the background) as soon
>> as the current is more than half full, or such a policy.
>>
>> Then you have the problem that you can't really use lseek() anymore to
>> guess'timate a relation size, but Tom said in this thread that the
>> planner certainly doesn't need something that accurate. Maybe the
>> reltuples would do? If not, it could be that some adapting of its
>> accuracy could be done?
> I think that pre-extending relations or extending them in larger
> increments is probably a good idea, although I think the AMOUNT of
> preallocation you just proposed would be severe overkill. If we
> extended the relation in 1MB chunks, we'd reduce the number of
> relation extensions by more than 99%, and with far less space wastage
> than the approach you are proposing.
Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.
On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.
I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.
--
Jesper
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2011-08-08 15:28:45 | Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs |
Previous Message | Kevin Grittner | 2011-08-08 15:02:26 | Re: WIP fix proposal for bug #6123 |