Quick Links

Re: GSoC project: K-medoids clustering in Madlib

From:	Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To:	Maxence AHLOUCHE <maxence(dot)ahlouche(at)gmail(dot)com>
Cc:	"Iyer, Rahul" <Rahul(dot)Iyer(at)emc(dot)com>, "pgsql-students(at)postgresql(dot)org" <pgsql-students(at)postgresql(dot)org>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "Philip, Sujit" <Sujit(dot)Philip(at)emc(dot)com>
Subject:	Re: GSoC project: K-medoids clustering in Madlib
Date:	2013-04-21 18:03:07
Message-ID:	46950856-D053-4ED5-8E4C-9F620E15C981@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-students

Sent from my iPad

On 21-Apr-2013, at 23:16, Maxence AHLOUCHE <maxence(dot)ahlouche(at)gmail(dot)com> wrote:

> Hello!
>
> I've pretty much improved my visualizer:
> It is now able to color the points according to their assigned cluster. It has occured to me last night that it was the main inconvenient of the k-means algorithm: a point is assigned to the nearest centroid's cluster, which is easy to compute.
> It's now able to load data from a file, if it has been previously generated, so it is no longer mandatory to specify the number of clusters wished when using "old" data.
> It displays the original clustering and the k-means++ clustering for comparison purposes, and it is very easy to add new clusterings.
> It is available on GitHub! https://github.com/viodlen/clustering_visualizer
> Still, there is room for improvements:
>
> It only tests with 2-dimensional spaces. This won't evolve, as it would get difficult to visualize.
> It only uses the euclidean distance. This can be fixed, but would be heavy to implement, and probably ugly (a hashmap to match the python's distance function with the MADlib's one).
> For now, the points are reparted in the clusters folowing a gaussian law (not sure of my vocabulary here). This can be easily fixed, and will probably be in a future version.
> In attachment is an output example. It shows the poor results of the k-means algorithm on contiguous clusters.
>
> It also shows the interest of the k-means++ algorithm: the small light-blue cluster (in the original clustering) is correctly identified as a complete cluster by the k-means++ algorithm, when it was only a part of a bigger cluster with the simple k-means algorithm.
>
>
> The characteristics of the k-means algorithm make it easy to calculate the clusters only from the points and the centroids, but this won't be true for the k-medoids algorithm. So, sadly, if I implement the latter, it won't be possible to keep the same function signature: some more data will have to be returned, along with the centroids list.
>
>
>
> I've also wondered if it would be useful to implement the clustering algorithms for non-float vectors (for example, strings), provided the user gives a distance function for this type?
>
>

Interesting! Good work!

Could you draw up a summary, giving your findings about the performance of different algorithms,and which one should be implemented,or both(k means++ vs k medoids).

Regards,

Atri

In response to

Re: GSoC project: K-medoids clustering in Madlib at 2013-04-21 17:46:46 from Maxence AHLOUCHE

Responses

Re: GSoC project: K-medoids clustering in Madlib at 2013-04-21 21:21:14 from Maxence AHLOUCHE

Browse pgsql-students by date

	From	Date	Subject
Next Message	Maxence AHLOUCHE	2013-04-21 21:21:14	Re: GSoC project: K-medoids clustering in Madlib
Previous Message	Maxence AHLOUCHE	2013-04-21 17:46:46	Re: GSoC project: K-medoids clustering in Madlib