Re: WIP: dynahash replacement for buffer table

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: dynahash replacement for buffer table
Date: 2014-11-09 20:58:15
Message-ID: 20141109205815.GA28007@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2014-11-07 11:08:57 -0500, Robert Haas wrote:
> On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > * In benchmarks it becomes apparent that the dynamic element width makes
> > some macros like CHashTableGetNode() and
> > CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
> > can't figure how to build an efficient expression for the
> > target. There's two ways to address this:
> > a) Actually morph chash into something that will be included into the
> > user sites. Since I think there'll only be a limited number of those
> > that's probably acceptable.
>
> Have you benchmarked this? How much does it help?

I've done quick benchmarks, and it helped in the 2-3% range in some
tests, and was a wash in others. What I did wasn't actually including it
in buf_table.c, but hardcoding the size in chash. That's obviously not
the real solution.

> > b) Simplify the access rules. I quite seriously doubt that the
> > interleaving of garbage and freelists is actually necessary/helpful.
>
> That wasn't added for nothing. I did a bunch of performance testing
> on this vs. dynahash (I think the test code is included in the patch)
> and the region of memory containing the freelists practically caught
> fire. Spreading them out like this greatly improved the performance
> under heavy concurrency, but even with those changes the freelists
> were extremely hot. Now, since this is the buffer mapping table
> rather than a tight loop around hash table manipulation, the
> difference may not be enough to measure. But on a microbenchmark it
> *definitely* was.

I think it'd be much less visible for the buffer manager, but it might
still be visible...

> > * There's several cases where it's noticeable that chash creates more
> > cacheline misses - that's imo a good part of why the single threaded
> > performance regresses. There's good reasons for the current design
> > without a inline bucket, namely that it makes the concurrency handling
> > easier. But I think that can be countered by still storing a pointer -
> > just to an element inside the bucket. Afaics that'd allow this to be
> > introduced quite easily?
>
> It's not obvious to me how that would work. If you just throw those
> elements on the garbage lists with everything else, it will soon be
> the case that one bucket header ends up using the cell stored in some
> other bucket, which isn't going to be awesome. So you need some kind
> of more complex scheme to preserve locality.

Treating the element inside the bucket as a kind of one element freelist
seems quite workable to me. Test whether it's marked deleted, scan the
hazard pointer list to see whether it can be used. If yes, grand. If
not, go to the current code. The fact that the element is cacheline
local will make the test for its deletedness almost free. Yes, the
hazard pointer scan surely ain't free - but at least for cases like
bufmgr where reads are never less frequent than writes and very often
*much* more frequent I'm pretty sure it'd be a noticeable win.

> > I'm afraid that I can't see us going forward unless we address the
> > noticeable single threaded penalties. Do you see that differently?
>
> I'm not sure. We're talking about something on the order of half a
> percent on my tests, and lots of changes cause things to bounce around
> by that much. I'm not sure it makes sense to get too worked up about
> that if the alternative is to add a big pile of new complexity.

I saw things in the range of 3-4% on my laptop.

> > * There's some whitespace damage. (space followed by tab, followed by
> > actual contents)
>
> That's the least of our problems at this point.

Sure, but still worth cleaning up ;)

> > * I still think it's a good idea to move the full memory barriers into
> > the cleanup/writing process by doing write memory barriers when
> > acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
> > in the process testing for the hazard pointer.
>
> Can you cite any existing precedent in other system software? Does
> Linux do anything like that, for example?

No, I can't right now. I think it follows from the cache coherency
rules, but I can understand suspicion there.

> > * Shouldn't we relax the CPU in a couple places like CHashAllocate(),
> > CHashBucketScan()'s loops?
>
> You mean like, have it sleep the way SpinLockAcquire() does?

Not actually like in s_lock(), but rather like the SPIN_DELAY() used in
the spinlock code for retries.

> Yeah, possibly, but that adds non-trivial code complexity which may
> not be worth it if - as is hoped for - there's no real contention
> there.

I think just adding a pg_spin_delay() before retrying should be trivial.

> > * I don't understand right now (but then I'm in a Hotel bar) why it's
> > safe for CHashAddToGarbage() to clobber the hash value?
> > CHashBucketScan() relies the chain being ordered by hashcode. No?
> > Another backend might just have been past the IsMarked() bit and look
> > at the hashcode?
>
> I think the worst case scenario is that we falsely conclude that
> there's a hash match when there really isn't. The ensuing memcmp()
> will clarify matters.

Hm. Let me think about it for a while.

I was thinking that a spurious cmp < 0 could causing us to to miss a
match - but that could only happen if the match just has been removed
from the list. Which is fine. Perhaps that case deserves to be mentioned
in the comment?

* Another thing I'm wondering about here is whether it's possible that
somebody is currently walking an "older version" of the bucket list
(i.e. is currently at an already unlinked element) and then visits a
newly inserted element with an 'apparently' out of order hash
value. That seems possible because the inserter might not have seen the
unlinked element. Afaics that's not a problem for searches - but I'm not
sure whether it couldn't cause a problem for concurrent inserts and
deletes.

* One thing I noticed while looking at that part of code is:
+ h = target_node->un.hashcode;
+ if (h == hashcode)
+ cmp = memcmp(CHashNodeGetItem(target_node), key,
+ table->desc.key_size);
+ else if (h > hashcode)
+ cmp = 1;
+ else
+ cmp = -1;

Compilers can feel entirely free to reload local variables from memory
(i.e. use target_node->un.hashcode instead of h) in situations like
these. That's why I used the ACCESS_ONCE macros in my freelist
patch. The same is done in a couple of other places (like looking at
->next). I'm *very* wary of relying on the compiler to not do stuff like
that. I unfortunately think you'll need similar things here.

* I noticed is the following comment:
+ * Note: We can't begin this operation until the clearing of the
+ * garbage list has been committed to memory, but since that was
+ * done using an atomic operation no explicit barrier is needed
+ * here.
+ *

Uh. I don't think that holds true. A memory barrier on one cpu doesn't
forbid all kinds of reordering on others.

* How is the consistency of the element data guaranteed? Afaics there
very well could be two concurrent CHashInsert()'s for the same key
interleaving inside their respective memcpy()s. It'll often be fine
for small element sizes, but even there the compiler might restort to
a char-by-char memcpy.

* Should CHashInsert perhaps be renamed to CHashSetKey or CHashUpdate -
since it's not actually just inserting?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-11-09 21:00:45 Re: row_to_json bug with index only scans: empty keys!
Previous Message Heikki Linnakangas 2014-11-09 19:47:37 Re: Sequence Access Method WIP