From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Simon Riggs <simon(at)2ndquadrant(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Gavin Hamill <gdh(at)acentral(dot)co(dot)uk> |
Subject: | Re: Further reduction of bufmgr lock contention |
Date: | 2006-08-26 03:23:35 |
Message-ID: | 200608260323.k7Q3NZZ02977@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Is this being kept for 8.3?
---------------------------------------------------------------------------
Tom Lane wrote:
> A couple months ago we were discussing partitioning the buffer mapping
> table so as to reduce contention for the BufMappingLock. The discussion
> stalled after I complained about the uncertainty of shared memory usage
> arising from having a separate hashtable for each partition (since
> different numbers of buffers might fall into each partition at different
> times). I've been thinking about it more after seeing an example from
> Tatsuo that seems to be exactly the same problem as Gavin saw.
>
> We could fix the uncertain-usage objection if we stick with a single
> hashtable and ensure that concurrent access to different partitions of
> it is safe. I believe this is easily doable, if we make hashtables
> used in this mode allocate the maximum allowed number of buckets
> immediately during hashtable initialization. (Since shared hashtables
> already have a fixed maximum directory size, they already have an upper
> bound on the number of buckets, so this loses no flexibility.) Then
> there will be no on-the-fly bucket splits, and that means that accesses
> to different hash buckets operate completely independently. Therefore,
> we can regard the hashtable as logically partitioned on the basis of any
> classification of entries that will uniquely assign hash buckets to
> classes --- taking the low order bits of entry hash codes will do fine.
>
> The only changeable state that is shared across all buckets is the entry
> freelist and the "nentries" counter. We could protect these with a
> spinlock (one spinlock is sufficient since changes to nentries go along
> with addition or removal of freelist entries).
>
> Usage of a partitioned hash table would then be like
>
> compute hashtable lookup key;
> entryhashcode = calc_hash(lookup key);
> partitionnumber = entryhashcode % NumPartitions;
> LWLockAcquire(PartitionLock[partitionnumber], ...);
> manipulate hashtable;
> LWLockRelease(PartitionLock[partitionnumber]);
>
> We could do this without changing the API of hash_search, but then we'd
> be computing the entry hashcode twice, so I'm inclined to provide an
> additional entry point that takes a precalculated hashcode.
>
> Potential downsides of applying this idea to the buffer mapping table:
>
> 1. Reassigning a buffer to a new page will (usually) require two cycles
> of LWLockAcquire/Release for the two different partitions involved.
> Since this path also requires at least a read() kernel call (maybe a
> write() too), I don't think there'll be any meaningful slowdown.
>
> 2. The current logic for reassigning a buffer attempts to make a
> hashtable entry for its new page number (to detect collisions) before
> releasing the old hashtable entry. This would only be safe if we held
> both partition LWLocks concurrently; which seems bad for concurrency,
> plus avoiding deadlock would complicate the code significantly. I'm
> inclined to release the old entry and then try to insert the new one,
> holding only one lock at a time. If the insertion fails (because
> someone was concurrently loading the same page), we'd have to throw the
> buffer onto the freelist rather than allowing it to retain its previous
> valid data. Which is slightly annoying, but in practice the case
> probably doesn't happen often enough to be worth worrying about.
>
> 3. Taking the freelist spinlock is new computation that wasn't there
> before. But, again, it's only needed in code paths that will also be
> doing a kernel call.
>
> If we do this we should probably also handle the lmgr lock tables the
> same way (partially reverting my partition-the-LockMgrLock patch of a
> couple months ago). However, downside #3 might be a stronger objection
> for lmgr, since it can create or destroy lock objects without necessarily
> doing any I/O.
>
> Thoughts, objections?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2006-08-26 03:29:47 | Re: Adding fulldisjunctions to the contrib |
Previous Message | Bruce Momjian | 2006-08-26 02:40:21 | Re: [HACKERS] Interval aggregate regression failure (expected |