Re: things I learned from working on memory allocation

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: things I learned from working on memory allocation
Date: 2014-07-07 21:37:32
Message-ID: CAM3SWZQkyurkShnsL_vQ6D2stCx-f6XxPvTc776Yidn_9DDhAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> 5. It's tempting to look at other ways of solving the parallel sort
> problem that don't need an allocator - perhaps by simply packing all
> the tuples into a DSM one after the next. But this is not easy to do,
> or at least it's not easy to do efficiently. Tuples enter tuplesort.c
> through one of the tuplesort_put*() functions, most of which end up
> calling one of the copytup_*() functions. But you can't easily just
> change those functions to stick the tuple someplace else, because they
> don't necessarily get the address of the space to be used from palloc
> directly. In particular, copytup_heap() calls
> ExecCopySlotMinimalTuple(), and copytup_cluster() calls
> heap_copytuple(). heap_copytuple() is simple enough that you might be
> able to finagle a new API that would make it work, but I'm not sure
> what we could really do about ExecCopySlotMinimalTuple() except call
> it and then copy the result. Perhaps that'd be OK for a first
> version.

Perhaps. If my experience with sort support for text is anything to go
on, the cost of copying over the tuples isn't really that high
relative to the cost of performing the sort, particularly when there
are many tuples to sort. OTOH, I'd be concerned about the cost of main
memory accesses for pass-by-reference types during parallel tuple
sorts in particular. The same concern applies to cases where the tuple
proper must be accessed, although presumably that occurs less
frequently.

The LLVM project's new libc++ library passes remark on a similar issue
in their rationale for yet another C++ standard library: "For example,
it is generally accepted that building std::string using the "short
string optimization" instead of using Copy On Write (COW) is a
superior approach for multicore machines...". It seems likely that
they're mostly referencing the apparent trend of memory bandwidth per
core stagnation [1], [2]. To get the real benefit of a parallel sort,
we must do anything we can to avoid having cores/workers remain idle
during the sort while waiting on memory access. I believe that that's
the bigger problem.

[1] http://www.admin-magazine.com/HPC/Articles/Finding-Memory-Bottlenecks-with-Stream
[2] https://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-07-07 23:07:04 Re: pg_dump reporing version of server & pg_dump as comments in the output
Previous Message Bruce Momjian 2014-07-07 21:33:02 Re: Pg_upgrade and toast tables bug discovered