From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Experimenting with hash join prefetch |
Date: | 2020-02-04 03:38:19 |
Message-ID: | CA+hUKGL1HQaf15cU=D1nRPOhzfA4K-s=9ZsyLYPYc5tN+MykuA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Feb 4, 2020 at 2:31 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> How much of the benefit here comes from the prefetching, and how much
> just from writing the code in a manner that allows for more out-of-order
> execution? Because there's no dependencies preventing execution of the
> next queued tuple anymore, I'd assume that this is a good part what
> helps?
A small part of the speed-up does indeed seem to come from that sort of thing.
> Code like this really should look something roughly like:
>
> while (true)
> have_skew = False
>
> # get tuples
> for i in 0..batchsize:
> tuples[i] = ExecProcNode(outerNode);
> if tuples[i] == NULL:
> # have slowpath handle this
> break;
>
> # compute their hash values
> for i in 0..batchsize:
> hashvalues[i] = ExecHashGetHashValue(tuples[i])
>
> # check whether go into skew buckets
> if have_skew_table:
> for i in 0..batchsize:
> skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i])
> if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO)
> have_skew = True
> if have_skew:
> # handle everything here
> continue
>
> # assume there's no skew tuples going forward, all handled above
>
> # compute bucket/batch for all tuples
> have_into_batch = False
> for i in 0..batchsize:
> hashbuckets[i] = ExecHashGetBucketAndBatch()
> if hashbuckets[i] != hashtable->curbatch:
> have_into_batchfile = True
>
> if have_into_batchfile:
> # slowpath
> continue
>
> # Allocate all tuples
> for i in 0..batchsize:
> hjtuples[i] = alloc_mintuple(hashtuples[i])
>
> # And finally isert them
> for i in 0..batchsize:
> hjtuple.next = buckets[hashbuckets[i]]
> buckets[hashbuckets[i]] = hjtuple
Hmm. I see your point: don't use the batch number for a branch
immediately, and so on. I thought a bit about a multi-pass system a
bit like that too just for prefetching purposes, though I haven't
tested due to lack of required infrastructure. I guess you need a way
to get the next N tuples and make them all simultaneously available
without copying them yet.
For this experiment, I speculated that it might be better to be
continually inserting a short distance behind so there are no
batch-boundary stalls anyway. Admittedly, it's pretty hard to choose
the right queue depth if your loop includes ExecProcNode() because you
have no idea what that actually does, but on the other hand, you do
need to put enough cycles between prefetch and fetch to see benefits,
so maybe that's not so crazy. Perhaps to get more speedup I'd need to
consider dependencies along the lines your'e describing, but also find
a way to keep prefetch and insertion far enough apart to win. Hmm.
> I would bet this helps significantly even if there's no prefetch
> instruction - but using explicit prefetching might help further. Also
> allows us to increase branch costs, because we can amortize them over a
> few tuples.
Yes, I already observe that performance improves a little bit even
with my simple insert-queue patch if you comment out the
pg_prefetch_mem() call, and figured it was something about execution
order at work there (though I didn't study the effect up close with
perf etc due to lack of PMC access on this host), but the prefetch
apparently supplies most of the speed-up I saw. It stands to reason
that hash joins should benefit from explicit prefetching (even though
lots of pixels have been expended explaining that explicit prefetching
is often a mistake, cf Linux experience), since hash joins are
basically cache miss machines par excellence, at least in the build
phase with unique keys.
> > create unlogged table t as select generate_series(1, 100000000)::int i;
> > select pg_prewarm('t');
> > set work_mem='8GB';
> >
> > select count(*) from t t1 join t t2 using (i);
> >
> > master patched/N=1 patched/N=4
> > workers=0 89.808s 80.556s (+11%) 76.864 (+16%)
> > workers=2 27.010s 22.679s (+19%) 23.503 (+15%)
> > workers=4 16.728s 14.896s (+12%) 14.167 (+18%)
> >
> > Just an early experiment, but I though it looked promising enough to share.
>
> Nice!
It's starting to look like prefetching of build + prefetching of probe
+ reordering-friendly code + icache-friendly tight loops could add up
to some serious gains, but some architectural stuff is needed for much
of that, hence my lower aim :-)
Other things I noticed on that hacking escapade: the patch generates
PREFETCHT0 instructions on my compiler, but there are also "write" and
"locality" flags you can pass to __builtin_prefetch() to get
PREFETCHW, and variants for predictions about how valuable the data is
after the next access; write=1 slowed down my initial tests for
reasons I don't fully understand, but I didn't look further once I
realised you need -march=<broadwell or later> anyway. I didn't look
into the locality/eviction stuff.
From | Date | Subject | |
---|---|---|---|
Next Message | Kuntal Ghosh | 2020-02-04 04:45:01 | Re: logical decoding : exceeded maxAllocatedDescs for .spill files |
Previous Message | Masahiko Sawada | 2020-02-04 03:17:14 | Re: Internal key management system |