From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: hint bit cache v6 |
Date: | 2011-07-05 16:44:42 |
Message-ID: | CAHyXU0ysTUJtZPKRqQctShpysn0QYyhUFDe1b+jcMVt58MqZ0w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jun 30, 2011 at 3:42 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>> I think the basic problem is that the cache pages are so large. It's
>>>> hard to make them smaller because that increases the cost of accessing
>>>> the cache, as you already noted, but at the same time, making an
>>>> eviction decision on a cache that holds 64K entries based on 100 data
>>>> points seems like waving around a pretty large-caliber weapon in a
>>>> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
>>>> it's hard to avoid the niggling fear that someone might accidentally
>>>> get shot.
>>>
>>> Right...it's essentially a rolling statistical sampling of transaction
>>> IDs based on past acceses. Going from say, 100 to 1000 will probably
>>> drop your errror margin quite a bit but I really wonder how benefit
>>> you can get from going past that.
>>
>> An interesting idea might be to forcibly cause a replacement every 100
>> tuples (or every 1000 tuples) and see if that causes a noticeable
>> performance regression. If it doesn't, that's a good data point.
>
> To test this I disabled the cache check on top of
> HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
> bogus xid (so every tuple scanned was treated as a 'miss'. Scanning 1
> million records in memory over and over went up a few percentage
> points -- converging on about 280ms instead of 270ms. This is with
> bumped to 1000 miss array:
>
> oprofile said:
> regular hinted tuple case:
> 120 10.2041 advance_aggregates
> 107 9.0986 heapgettup_pagemode
> 77 6.5476 ExecProject
> 74 6.2925 heapgetpage
> 72 6.1224 ExecProcNode
> 72 6.1224 ExecScan
> 69 5.8673 advance_transition_function
> 66 5.6122 heap_getnext
> 58 4.9320 HeapTupleSatisfiesMVCC
>
> hinted tuple with pathological cache entry:
> 111 8.9588 advance_aggregates
> 109 8.7974 heapgettup_pagemode
> 85 6.8604 ExecProject
> 81 6.5375 heapgetpage
> 77 6.2147 ExecScan
> 70 5.6497 ExecStoreTuple
> 70 5.6497 HeapTupleSatisfiesMVCC
>
> this was a fairly short profiling run but the numbers are fairly
> consistent. the replacement is fairly cheap...sorting 1000 ints
> doesn't take all that long. 100 is even faster.
>
>> I think the sour spot for this whole idea is likely to be the case
>> where you have a workload that is 90% read and 10% write, or something
>> like that. On an all-read workload (like pgbench -S), you're
>> hopefully going to converge to a state where all the hint-bits are
>> set, once VACUUM kicks in. And on an all-write workload I think that
>> you might lose the effect in the noise. Not sure if we have an easy
>> way to set up such a workload with pgbench, though.
>
> it's trivial to test with pgbench -- just use the random number
> facility to generate an id for some table and update where random() >
> .9. However this does not generate nearly enough 'misses' to
> exercise cache replacement in any meaningful way. My workstation vm
> can punch out about 5k transactions/sec. With only 10% getting a new
> xid and writing back a tuple to the table only 500 xids are getting
> generated/second. At that rate it takes quite a while to burn through
> the 256k transactions the cache ranges over. Also this case is not
> bound by scan performance anyways making profiling it a little silly
> -- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
> workloads. Scan heavy loads is where this matters, and scan heavy
> loads naturally tend to generate less xids per tuple scanned.
>
>>>> Just for the sake of argument, let's suppose we made an array with 64K
>>>> elements, each one representing one 64K chunk of the XID space. Each
>>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>>> memory... probably not good, but I'm just thinking out loud here.
>>>> Every time we're asked about an XID, we bump the corresponding
>>>> counter. Periodically (say, every 64K probes) we zip through all the
>>>> counters and consider performing a cache replacement. We look at the
>>>> not-already-cached page that has the highest counter value and compare
>>>> it to the counter value for the cached pages. If it is higher and the
>>>> difference exceeds some safety margin (to protect against hysteresis),
>>>> then we do the replacement. I think something like this would allow
>>>> you to have a lot more information available to make a direct
>>>> apples-to-apples comparison at cache replacement time, as compared
>>>> with what you have now.
>>>
>>> yeah -- what you've done here is basically two things: 1. by mapping
>>> static ranges you get to skip the sort (but not the scan) during the
>>> replacement, and 2. by vastly expanding the sampling size you
>>> eliminate thrashing via noise in the rolling sample. This comes at a
>>> significant memory cost and loss of some nimbleness in the cache (i'm
>>> not completely sure more aggressive replacement is 'bad') -- although
>>> that could be mitigated by replacing at more frequent intervals.
>>
>> Well, that gets at an interesting question: how often SHOULD we
>> replace cache pages? My gut is that it's 10K-100K and your gut is
>> that it's 100-1000, which is a hundred-fold difference. Seems like we
>> need some kind of emprical data to decide what actually works well in
>> real life.
>
> I think using a smaller 'n' for a sort based replacement is on solid
> mathematical grounds. Amortized replacement performance is (lg(n) +
> (k/n)) * q where q is the miss rate and k is the page replacement
> cost. k/n is close to zero for n >= 100 so it's really lg(n) * q.
> From this we can deduce that since lg(10k) is roughly double lg(100),
> you'd have to get double the hit rate to break even and I just don't
> think that's going to happen -- it's pretty hard to even contrive
> cases with miss rates > .05 (and from there replacement performance is
> moot).
>
> OTOH, your approach wants n to be larger because presumably the hit
> rate would be better -- amortized replacement performance is k/n * q.
> The value that gives the lowest q within reasonable memory constraints
> is what you'd want. hm.
I'm going to experiment with a new implementation incorporating some
of Robert's ideas. Even if you spend a (very) few more cycles in the
satisfies routines getting and setting the cache, it adds peace of
mind when you know your replacement is cheap and degenerate cases wont
be badly penalized -- especially when they are so difficult to test
for. The chance if this getting finished before the end of the
current commit fest is zero so I'm marking the patch 'returned with
feedback'.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2011-07-05 16:54:34 | Re: Range Types, constructors, and the type system |
Previous Message | Abel Abraham Camarillo Ojeda | 2011-07-05 16:30:56 | Re: Extra check in 9.0 exclusion constraint unintended consequences |