Re: [GSoC] Clustering in MADlib - status update

From: Hai Qian <hqian(at)gopivotal(dot)com>
To: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-02 17:16:30
Message-ID: CACGxcfQ_KurZ1gtaa8HzqKkdQw52Qi2RtBQR01_P=HuMF-byXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I like the second option for refactoring the code. I think it is doable.

And where is your code on Github?

Hai

--
*Pivotal <http://www.gopivotal.com/>*
A new platform for a new era

On Sun, Jun 1, 2014 at 1:06 PM, Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com
> wrote:

> Hi all!
>
> I've pushed my report for this week on my repo [0]. Here is a copy!
> Attached is the patch containing my work for this week.
> Week 2 - 2014/01/01
>
> This week, I have worked on the beginning of the kmedoids module.
> Unfortunately, I was supposed to have something working for last Wednesday,
> and it is still not ready, mostly because I've lost time this week by being
> sick, and by packing all my stuff in preparation for relocation.
>
> The good news now: this week is my last school (exam) week, and that means
> full-time GSoC starting next Monday! Also, I've studied the kmeans module
> quite thoroughly, and I can finally understand how it all goes on, at the
> exception of one bit: the enormous SQL request used to update the
> IterationController.
>
> For kmedoids, I've abandoned the idea of making the loop by myself and
> have decided instead to stick to copying kmeans as much as possible, as it
> seems easier than doing it all by myself. The only part that remains to be
> adapted is that big SQL query I haven't totally understood yet. I've asked
> the help of Atri, but surely the help of an experienced MADlib hacker would
> speed things up :) Atri and I would also like to deal with this through a
> voip meeting, to ease communication. If anyone wants to join, you're
> welcome!
>
> As for the technology we'll use, I have a Mumble server running somewhere,
> if that fits to everyone. Otherwise, suggest something!
>
> I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
> (exam weeks are quite light).
>
> This week, I have also faced the first design decisions I have to make.
> For kmedoids, the centroids are points of the dataset. So, if I wanted to
> identify them precisely, I'd need to use their ids, but that would mean
> having a prototype different than the kmeans one. So, for now, I've decided
> to use the points coordinates only, hoping I will not run into trouble. If
> I ever do, switching to ids should'nt be too hard. Also, if the user wants
> to input initial medoids, he can input whatever points he wants, be they
> part of the dataset or not. After the first iteration, the centroids will
> anyway be points of the dataset (maybe I could just select the points
> nearest to the coordinates they input as initial centroids).
>
> Second, I'll need to refactor the code in kmeans and kmedoids, as these
> two modules are very similar. There are several options for this:
>
> 1. One big "clustering" module containing everything
> clustering-related (ugly but easy option);
> 2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
> submodules (the best imo, but I'm not sure it's doable);
> 3. A "clustering_utils" module at the same level as the others (less
> ugly than the first one, but easy too).
>
> Any opinions?
>
> Next week, I'll get a working kmedoids module, do some refactoring, and
> then add the extra methods, similar to what's done in kmeans, for the
> different seedings. Once that's done, I'll make it compatible with all
> three ports (I'm currently producing Postgres-only code, as it's the
> easiest for me to test), and write the tests and doc. The deadline for this
> last step is in two weeks; I don't know yet if I'll be on time by then or
> not. It will depend on how fast I can get kmedoids working, and how fast
> I'll go once I'm full time GSoC.
>
> Finally, don't hesitate to tell me if you think my decisions are wrong,
> I'm glad to learn :)
> [0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
>
>
> --
> Maxence Ahlouche
> 06 06 66 97 00
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-06-02 17:29:25 Re: Re-create dependent views on ALTER TABLE ALTER COLUMN ... TYPE?
Previous Message Jeff Janes 2014-06-02 17:15:19 Re: recovery testing for beta