RFC: Improve CPU cache locality of syscache searches

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: RFC: Improve CPU cache locality of syscache searches
Date: 2021-08-04 16:39:29
Message-ID: CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

CPU caches have multiple levels, so I had an idea to use that concept in
the syscaches.

Imagine if catcache buckets were cacheline-sized. In that case, we would
store the most recently accessed hash values and pointers to catctup, in
addition to the dlist_head:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

You can think of the bucket here as a 4-way set associative L1 and the list
walk as an inclusive L2. There might be an existing term for this scheme,
but I don't know what it is offhand.

Creating entries:

Instead of calling dlist_push_head(), we call dlist_push_tail() and then
stash the entry in the L1 so it's still quickly available if it's accessed
in the near future. This way, older entries evicted from L1 will tend to
remain toward the front of the list. Walking the list will always go from
oldest to newest, which might have better prefetch behavior (not sure).

Search:

Buckets with <= 4 entries would only need to access a single cacheline to
get the catctup pointer with the matching hash value. If we have to walk
the list to find a match, we stash it in the L1, which is faster than
calling dlist_move_head().

L1 Eviction:

When putting an entry here, we memmove() everything down in each array and
place the pointer and the hash value in the front, evicting the fourth
(possibly NULL) entry.

The buckets would now be 4 times larger on 64-bit machines, but that might
not be a problem if we just divide the initial bucket size by four as well.
If the minimum nbuckets is 2, then the smaller caches would waste a bit of
space, but it doesn't seem too bad. As far as when to rehash the bucket
array, the fill factor would logically jump from 2 to 8. The worst-case
list walk would be longer, but with a better overall memory access pattern,
it might be fine.

I think it would be straightforward to code this up -- the difficulty is
testing and accounting for random effects due to binary layout changes.
Thoughts?

--
John Naylor
EDB: http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2021-08-04 17:41:41 Re: A varint implementation for PG?
Previous Message Robert Haas 2021-08-04 14:39:01 Re: Lowering the ever-growing heap->pd_lower