From: | Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: XPRS |
Date: | 2019-08-23 15:19:00 |
Message-ID: | 20190823151900.egnzprlcyejq6w4p@development |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Aug 22, 2019 at 11:41:45AM +1200, Thomas Munro wrote:
>Hello,
>
>After rereading some old papers recently, I wanted to share some
>thoughts about XPRS and modern PostgreSQL. XPRS stood for "eXtended
>Postgres on RAID and Sprite", and was a research project done nearly
>three decades ago at Berkeley by the POSTGRES group working with
>operating system researchers, on a shared memory machine with a lot of
>CPUs for the time (12).
>
>As far as I can tell (and if anyone knows how to find out, I'd love to
>know), the parallel query parts of the XPRS system described in
>various writings by Wei Hong and Michael Stonebraker are actually
>present in the POSTGRES 4.2 tarball and were removed in Postgres95.
>Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we
>know that XPRS ran on a Sequent Symmetry; the parallel hash join
>algorithm matches the description given in Hong's doctoral thesis; the
>page and range based parallel scans seen in various places also seem
>to match.
>
>Hong's thesis covers a lot of material and I certainly haven't
>understood all of it, but basically it's about how to share CPU, IO
>bandwidth and memory out fairly and dynamically at execution time so
>you're using the whole system efficiently. Facets of this problem
>obviously keep coming up on this mailing list (see practically any
>discussion of parallel degree, admission control, prefetch or
>work_mem, all of which we punt on by deferring to user supplied GUCs
>and scan size-based heuristics).
>
>Here are three things I wanted to highlight from Hong's 1991 paper[1]
>(later writings elaborate greatly but this one is short and much
>easier to read than the thesis and sets the scene):
>
>1. "The overall performance goal of a multiprocessor database system
>is to obtain increased throughput as well as reduced response time in
>a multiuser environment. The objective function that XPRS uses for
>query optimization is a combination of resource consumption and
>response time as follows: cost = resource_consumption + w *
>response_time, where w is a system-specific weighting factor."
>
>2. The "Buffer-Size-Independent Hypothesis" (here meaning work_mem):
>"The choice of the best sequential plan is insensitive to the amount
>of buffer space available as long as the buffer size is above the hash
>join threshold" (with a caveat about localised special cases that can
>be handled by choosing alternative subplans at runtime).
>
>3. The "Two-Phase Hypothesis": "The best parallel plan is a
>parallelization of the best sequential plan."
>
>I read all of that a while back while working on bits of parallel
>query machinery (though I only realised after the fact that I'd
>implemented parallel hash the same way as Hong did 27 years earlier,
>that is, shared no-partition, which is now apparently back in vogue
>due to the recent ubiquity of high core count shared memory systems,
>so that every server looks a bit like a Sequent Symmetry; for example
>Oracle is rumoured to have a "parallel shared hash join" like ours in
>the pipeline). I didn't understand the context or importance of XPRS,
>though, until I read this bit of Hellerstein's "Looking Back at
>Postgres"[2]:
>
>"In principle, parallelism “blows up” the plan space for a query
>optimizer by making it multiply the traditional choices made during
>query optimization (data access, join algorithms, join orders) against
>all possible ways of parallelizing each choice. The basic idea of what
>Stonebraker called “The Wei Hong Optimizer” was to cut the problem in
>two: run a traditional single-node query optimizer in the style of
>System R, and then “parallelize” the resulting single-node query plan
>by scheduling the degree of parallelism and placement of each operator
>based on data layouts and system configuration. This approach is
>heuristic, but it makes parallelism an additive cost to traditional
>query optimization, rather than a multiplicative cost.
>
I think this relies on a huge assumption that all steps in any sequential
plan can be parallelized. Which certainly is not true for PostgreSQL - as
you point out later on the join example. That means the optimal join order
may be different for parallel plan, and so on.
The other thing is that "parallelizations" of different sequential plans
may have different requirements for resources (say, work_mem), so I'd
expect cases when a parallel version of a "worse" sequential plan may end
up being superior thanks to allowing larger number of workers.
>Although “The Wei Hong Optimizer” was designed in the context of
>Postgres, it became the standard approach for many of the parallel
>query optimizers in industry."
>
I assume this quote is from 30 years ago. I wonder if the claim is still
true, on current hardware (including e.g. distributed databases).
>I don't know what to think about the buffer-size-independent
>hypothesis, but the two-phase hypothesis and the claim that is is the
>standard approach caught my attention. Firstly, I don't think the
>hypothesis holds on our system currently, because (for example) we
>lack parallel merge joins and sorts, so you couldn't parallelise such
>serial plans, and yet we'd already have thrown away a hash join based
>plan that would be vastly better in parallel. That might be just an
>implementation completeness problem. I wonder what fundamental
>problems lurk here. (Perhaps the non-availability of globally unique
>partial paths?) Anyway, AFAICS we do the exact thing Hong wanted to
>avoid: we plan parallel queries as extra paths at planning time. We
>don't really suffer too much of a planning explosion though, because
>we don't consider different parallel degrees. If we did, because our
>cost model doesn't include any penalty for resource usage, I suspect
>we'd always go for the maximum number of workers because they're
>'free', which creates a perverse incentive to burn resource (CPU +
>copies of work_mem). Those are all problems Hong solved with
>execution time resource allocation, as part of a bigger picture.
>
I don't know. I'd guess the hardware changed quite a bit, so maybe some of
the assumptions from the paper are too simplistic nowadays? Consider for
example the memory hierarchy - 30 years ago the amount of on-CPU cache was
miniscule, while now we have L1/L2/L3 caches that are tens of megabytes.
It's usually much faster to do sorts that fit into L3, for example.
FWIW I think we'll have to do something about resource acquisition, sooner
or later. It was always quite annoying that we don't really consider
memory consumption of the query as a whole during planning, and parallel
query made it a bit more painful.
>I have no idea what to do about any of this but thought that was an
>interesting bit of our project's history worth sharing. It's really
>humbling to read these old papers. I wonder if we're missing a chance
>to stand on the shoulders of giants.
>
Thanks, I think it's always useful / interesting to look at papers like
this. I don't know if we can use the stuff described in those papers
directly, but maybe we can build on those ideas and see which of the
assumptions are no longer true.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2019-08-23 15:38:25 | Re: Cleanup isolation specs from unused steps |
Previous Message | Stephen Frost | 2019-08-23 14:35:17 | Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS) |