From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: hint bit cache v6 |
Date: | 2011-06-30 16:44:13 |
Message-ID: | BANLkTimyKEyR16vT_0uM+PfBPDNqTspozQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> I think the basic problem is that the cache pages are so large. It's
>> hard to make them smaller because that increases the cost of accessing
>> the cache, as you already noted, but at the same time, making an
>> eviction decision on a cache that holds 64K entries based on 100 data
>> points seems like waving around a pretty large-caliber weapon in a
>> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
>> it's hard to avoid the niggling fear that someone might accidentally
>> get shot.
>
> Right...it's essentially a rolling statistical sampling of transaction
> IDs based on past acceses. Going from say, 100 to 1000 will probably
> drop your errror margin quite a bit but I really wonder how benefit
> you can get from going past that.
An interesting idea might be to forcibly cause a replacement every 100
tuples (or every 1000 tuples) and see if that causes a noticeable
performance regression. If it doesn't, that's a good data point.
I think the sour spot for this whole idea is likely to be the case
where you have a workload that is 90% read and 10% write, or something
like that. On an all-read workload (like pgbench -S), you're
hopefully going to converge to a state where all the hint-bits are
set, once VACUUM kicks in. And on an all-write workload I think that
you might lose the effect in the noise. Not sure if we have an easy
way to set up such a workload with pgbench, though.
>> Just for the sake of argument, let's suppose we made an array with 64K
>> elements, each one representing one 64K chunk of the XID space. Each
>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>> memory... probably not good, but I'm just thinking out loud here.
>> Every time we're asked about an XID, we bump the corresponding
>> counter. Periodically (say, every 64K probes) we zip through all the
>> counters and consider performing a cache replacement. We look at the
>> not-already-cached page that has the highest counter value and compare
>> it to the counter value for the cached pages. If it is higher and the
>> difference exceeds some safety margin (to protect against hysteresis),
>> then we do the replacement. I think something like this would allow
>> you to have a lot more information available to make a direct
>> apples-to-apples comparison at cache replacement time, as compared
>> with what you have now.
>
> yeah -- what you've done here is basically two things: 1. by mapping
> static ranges you get to skip the sort (but not the scan) during the
> replacement, and 2. by vastly expanding the sampling size you
> eliminate thrashing via noise in the rolling sample. This comes at a
> significant memory cost and loss of some nimbleness in the cache (i'm
> not completely sure more aggressive replacement is 'bad') -- although
> that could be mitigated by replacing at more frequent intervals.
Well, that gets at an interesting question: how often SHOULD we
replace cache pages? My gut is that it's 10K-100K and your gut is
that it's 100-1000, which is a hundred-fold difference. Seems like we
need some kind of emprical data to decide what actually works well in
real life.
> Your range counting strategy might work -- I think to do it right so
> that you don't have to map the entire range is going to require two
> levels of ranges such that you activate a very wide range first before
> looking at 64k subranges. That way you could reduce memory consumption
> significantly without sorting during the replacement.
I don't think you need two levels of ranges. Just keep track of the
minimum and maximum XID covered by the array (these must always be 64M
transactions apart, say, if the array has 1024 entries, and each cache
page covers 64K XIDs). When you see a given XID, and it's between the
minimum and the maximum, then bump the counter in bucket XID/64K. If
you see an XID that is older than the minimum, then ignore it; we
won't consider devoting cache pages to really old XIDs. If you see an
XID that is newer than the maximum, then just increase the minimum and
maximum by enough to put the new XID in range. For every 64K
increment by which you advance the minimum and maximum, you need to
zero one entry in the array (since it's no longer covering the same
portion of the XID space) but all the other entries remain intact.
(There is a small problem here of how to work out how to initially set
the minimum and maximum to something sensible, and to make sure that
they don't get out of whack if a backend sits idle for a few billion
transactions, but that seems like it should be solvable.)
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | David E. Wheeler | 2011-06-30 16:58:42 | Re: Range Types, constructors, and the type system |
Previous Message | Jeff Davis | 2011-06-30 16:40:53 | Re: Range Types, constructors, and the type system |