Re: Google Summer of code 2013

From: "Karel K(dot) =?UTF-8?B?Um96aG/FiA==?=" <karel(dot)rozhon(at)gmail(dot)com>
To: pgsql-students(at)postgresql(dot)org
Subject: Re: Google Summer of code 2013
Date: 2013-04-15 15:51:22
Message-ID: 20130415175122.ca9c29b83ccd4d5f4cd490f6@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-students

At first I would like to thank you for your responses and opinions.

Of course I don't see all aspects of this problem, so I cannot tell what should be good for future. But I have done some profiles of group by select and I believe, parallel calling of some hash procedures could help.

Set an examle:

postgres=# \d charlie
Table "public.charlie"
Column | Type | Modifiers
--------+---------+-----------
a | integer | not null
b | integer |
c | integer |
Indexes:
"charlie_pkey" PRIMARY KEY, btree (a)

insert into charlie (a,b,c) values (generate_series(1,15000000), 5, generate_series(1,100000));

postgres=# explain select count(*) from (select min(a), max(b) from charlie group by c offset 0) x;
QUERY PLAN

--------------------------------------------------------------------------------
------------
Aggregate (cost=3196549.18..3196549.19 rows=1 width=0)
-> Limit (cost=3044301.02..3195298.60 rows=100047 width=12)
-> GroupAggregate (cost=3044301.02..3195298.60 rows=100047 width=12)
-> Sort (cost=3044301.02..3081800.29 rows=14999711 width=12)
Sort Key: charlie.c
-> Seq Scan on charlie (cost=0.00..231079.11 rows=1499971
1 width=12)
(6 rows)

Oprofile output:
CPU: Intel Westmere microarchitecture, speed 2.267e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
85519 38.0101 hash_search_with_hash_value
14009 6.2265 slot_deform_tuple
11808 5.2482 ExecProject
7675 3.4113 slot_getattr
7539 3.3508 advance_aggregates
6932 3.0810 ExecAgg

Of course I know, these simply case is only teoretical and in real tables are data much more complicated, but as I can see, almost 40% of CPU time was computed only one hash function: hash_search_with_hash_value.

On Sun, 14 Apr 2013 21:54:00 -0400
Stephen Frost <sfrost(at)snowman(dot)net> wrote:

> Tomas,
>
> * Tomas Vondra (tv(at)fuzzy(dot)cz) wrote:
> > There's certainly a lot of groundwork to do, and I do share the concern
> > that the project will have to deal with a lot of dirty work (e.g. when
> > transfering data between the processes). But couldn't it be a useful
> > part of the discussion?
>
> A one-off implementation of parallelized hash table building and/or
> usage..? No, I don't see that as particularly relevant to the
> discussion around how to do parallelize queries. There are a ton of
> examples already of parallel hash table building and various other
> independent pieces (parallel sort, parallel aggregation, etc). What is
> needed for parallel query processing in PG is to figure out what we mean
> by it and how to actually implement it. Following that would be making
> the query planner and optimizer aware of it, last would be picking a
> specific parallelized implementation of each node and writing it (and
> that, really, is the 'easy' part in all of this...).
>
> > I don't expect a commitable patch at the end, but rather something that
> > "works" and may be used as a basis for improvements and to build the
> > actual groundwork.
>
> I don't think it'd really help us get any farther with parallel query
> execution. To be honest, I'd be a bit surprised if this hasn't been
> done already and patches posted to the list in the past..
>
> > I think that depends on the workload type. For example for databases
> > handling DWH-like queries, parallel hash aggregate is going to be a
> > major improvement.
>
> DWH is what I deal with day-in and day-out and I certainly agree that
> parallelizing hash builds would be wonderful- but that doesn't mean that
> a patch which implements it without any consideration for the rest of
> the challenges around parallel query execution will actually move us, as
> a project, any closer to getting it. In fact, I'd expect most DWH
> implementations to do what we've done already- massive partitioning and
> parallelizing through multiple client connections.
>
> > Karel mentioned he's currently working on his bachelor thesis, which is
> > about hash tables too. That's another reason why he proposed this topic.
>
> That's wonderful, I'd love to hear about some ways to improve our
> hashing system (I've even proposed one modification a few days ago that
> I'd like to see tested more). I believe that costing around hashing
> needs to be improved too.
>
> Parallel-anything is a 'sexy' project, but unless it's focused on how we
> answer the hard questions around how do we do parallel work efficiently
> while maintaining scalability and portability then it's not moving us
> forward.
>
> Thanks,
>
> Stephen

--
Karel K. Rozhoň <karel(dot)rozhon(at)gmail(dot)com>

In response to

Responses

Browse pgsql-students by date

  From Date Subject
Next Message Akansha Singh 2013-04-15 15:51:35 Re: Google Summer of code 2013
Previous Message Atri Sharma 2013-04-15 15:46:42 Re: Google Summer of code 2013