Re: WIP: [[Parallel] Shared] Hash

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(at)enterprisedb(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: WIP: [[Parallel] Shared] Hash
Date: 2017-03-27 01:50:22
Message-ID: CAEepm=0hUD+JfGLeFrdLU+80zQEqHA1SC7bu84CMbLERVLTCag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 27, 2017 at 12:12 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>>> 1. Segments are what buffile.c already calls the individual
>>> capped-at-1GB files that it manages. They are an implementation
>>> detail that is not part of buffile.c's user interface. There seems to
>>> be no reason to change that.
>>
>> After reading your next email I realised this is not quite true:
>> BufFileTell and BufFileSeek expose the existence of segments.
>
> Yeah, that's something that tuplestore.c itself relies on.
>
> I always thought that the main reason practical why we have BufFile
> multiplex 1GB segments concerns use of temp_tablespaces, rather than
> considerations that matter only when using obsolete file systems:
>
> /*
> * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
> * The reason is that we'd like large temporary BufFiles to be spread across
> * multiple tablespaces when available.
> */
>
> Now, I tend to think that most installations that care about
> performance would be better off using RAID to stripe their one temp
> tablespace file system. But, I suppose this still makes sense when you
> have a number of file systems that happen to be available, and disk
> capacity is the main concern. PHJ uses one temp tablespace per worker,
> which I further suppose might not be as effective in balancing disk
> space usage.

I was thinking about IO bandwidth balance rather than size. If you
rotate through tablespaces segment-by-segment, won't you be exposed to
phasing effects that could leave disk arrays idle for periods of time?
Whereas if you assign them to participants, you can only get idle
arrays if you have fewer participants than tablespaces.

This seems like a fairly complex subtopic and I don't have a strong
view on it. Clearly you could rotate through tablespaces on the basis
of participant, partition, segment, some combination, or something
else. Doing it by participant seemed to me to be the least prone to
IO imbalance cause by phasing effects (= segment based) or data
distribution (= partition based), of the options I considered when I
wrote it that way.

Like you, I also tend to suspect that people would be more likely to
use RAID type technologies to stripe things like this for both
bandwidth and space reasons these days. Tablespaces seem to make more
sense as a way of separating different classes of storage
(fast/expensive, slow/cheap etc), not as an IO or space striping
technique. I may be way off base there though...

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2017-03-27 02:01:42 Re: Logical decoding on standby
Previous Message Craig Ringer 2017-03-27 01:31:14 Re: logical decoding of two-phase transactions