Re: New 9.6 external sort guidance around temp_tablespaces and maintenance_work_mem

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New 9.6 external sort guidance around temp_tablespaces and maintenance_work_mem
Date: 2016-04-30 19:19:02
Message-ID: CAM3SWZQdB=+ZkyQYqsXroKFQTYM4rN4jpys_Ek4J5=bHnycMDQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Apr 30, 2016 at 9:30 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> I think we can expand "21.6. Tablespaces" to describe the implications
>> of these new performance characteristics. I'd like to hear opinions on
>> how to approach that before proposing a patch, though.

> This seems very detailed. I think we need much broader coverage of how
> the existing GUC variables affect performance before we could cover
> this.

While my justification is very detailed, I think that the practical
guidance itself is fairly simple. In fact, making sure that people
don't miss that things have become *simpler* across the board is the
primary point that I'd like to make to users.

As of 9.6, external sorting finally benefits from very large amounts
of memory, certainly in the multiple gigabyte range, where before
performance steadily declined as work_mem increased past a fairly
small amount (less than 100MB in my tests). So, DBAs had competing,
even contradictory considerations: keep work_mem high enough to make
most sorts internal. But if a sort must be external, make sure that
the work_mem it has available is quite low. Say, less than 100MB. This
set of characteristics is almost impossible to tune for. Other
operations that use temporary memory bound in size by work_mem have
always benefited from increasing work_mem settings in a more or less
predictable, linear way, so the *general* picture about what to do
becomes far clearer (we don't really *need* to talk about sorting at
all).

Presumably due to the old issues with tuplesort, the closest the docs
get to recommending higher work_mem or maintenance_work_mem settings
is: "Larger [maintenance_work_mem] settings might improve performance
for vacuuming and for restoring database dumps". That's it! Since the
performance characteristics of external sorting are now roughly in
line with everything else, why continue to make such a weak statement
in 9.6? It's not hard to understand why we originally equivocated
here, but things have changed.

I hardly realized that the docs are only lukewarm on the idea that
increasing work_mem will ever help *anything* at all. It's easy to
fail to notice that when you're an expert. We provide *zero* guidance
on how to tune work_mem or maintenance_work_mem. Surely we can do
better than that.

A secondary point I'd like to make is that if and when no further
benefit can be observed from increasing work_mem, it's well worth
considering that more I/O bandwidth for temp files will help.
Obviously, this assumes the DBA avoids swapping when using all that
memory, for example by setting vm.swappiness appropriately on Linux,
while also avoiding OOMs. At the point that no further benefit can be
obtained by increasing work_mem, sequential I/O should be strongly
considered as a possible bottleneck to target. That's a nice problem
to have, because you can buy fairly inexpensive HDDs for temp files
that will increase the point at which higher work_mem settings will no
longer help.

So, we can talk about this stuff without necessarily even mentioning
external sorting.

I didn't mention it before now, but as it happens the 2009 ACM article
I linked to already (https://queue.acm.org/detail.cfm?id=1563874)
mentions that they found performance fell sharply past a certain point
when using Postgres for a large aggregate SQL query:

"""
Invoking the DBMS’s built-in EXPLAIN facility revealed the problem:
while the query planner chose a reasonable hash table-based
aggregation strategy for small tables, on larger tables it switched to
sorting by grouping columns—a viable, if suboptimal strategy given a
few million rows, but a very poor one when facing a billion.

"""

I think that this must be describing tuplesort's previous use of
replacement selection to sort runs -- it might have been the case that
an internal sort for the GroupAggregate was "viable, if suboptimal"
then, but an external sort was just unacceptable. 9.6 really blurs the
distinction between internal and external sorts a lot of the time. It
was common to see a big external sort with ~5 runs taking no longer
than 110% of the time of an equivalent internal sort when testing the
9.6 sorting stuff. I would like to help DBAs keep up the benefits when
they have hundreds of gigabytes or even terabytes of data to aggregate
through. If you look at the graph in that article, it isn't that hard
to imagine that a well-tuned implementation could avoid that
superlinear growth in query runtime, because *superlinear* growth
isn't predicted by any theoretical model. The article is incorrect to
state "There is no pathology here". Although, that doesn't undermine
the author's argument too much; I find it rather convincing overall.
The fact that a single node will eventually fall short doesn't mean it
isn't important to push your single node further, to maintain an
unsharded Postgres instance as the right choice for longer (by getting
the full benefit of the hardware). Getting this right could allow
unsharded Postgres to go a lot further (parallelism is the final part
of this), which could make all the difference for some users.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Christoph Berg 2016-04-30 20:12:21 Re: relocation truncated to fit: citus build failure on s390x
Previous Message Tomasz Rybak 2016-04-30 18:25:18 Re: Lets (not) break all the things. Was: [pgsql-advocacy] 9.6 - > 10.0