Re: Cursor + upsert (astronomical data)

From: Craig James <cjames(at)emolecules(dot)com>
To: Jiří Nádvorník <nadvornik(dot)ji(at)gmail(dot)com>
Cc: Vitalii Tymchyshyn <vit(at)tym(dot)im>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Cursor + upsert (astronomical data)
Date: 2014-07-29 14:46:59
Message-ID: CAFwQ8reKBGki40i9BkB=PDj5F79NQ6rB1P=P9yvQqM_DsLU7gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi Jiri,

I’m really interested in those [clustering] algorithms and study them. But
> I would need somebody to point me directly at a specific algorithm to look
> at. The main problem with choosing the right one (which couldn’t get over
> even my university teacher) is that you don’t know the number of clusters
> (classification problem) and you don’t know the number of objects in one
> cluster (that isn’t such a big deal), but each cluster has different count
> of objects – which is a big issue because it actually disqualifies all of
> k-means based algorithms (correct me if I’m wrong).
>

I'm not an expert in clustering, I just worked with several people who
developed clustering packages.

One algorithm that's widely used in chemistry and genetics is the
Jarvis-Patrick algorithm. Google for "jarvis patrick clustering" and you'll
find several good descriptions. It has the advantages that it's
deterministic, it works with any distance metric, and it chooses the number
of clusters (you don't have to specify ahead of time). The drawback to
Jarvis-Patrick clustering is that you have to start with a
nearest-neighbors list (i.e. for each item in your data set, you must find
the nearest N items, where N is something like 10-20.) In a brute-force
approach, finding nearest neighbors is O(N^2) which is bad, but if you have
any method of partitioning observations to narrow down the range of
neighbors that have to be examined, the running time can be dramatically
reduced.

One advantage of JP clustering is that once you have the nearest-neighbors
list (which is the time-consuming part), the actual JP clustering is fast.
You can spend some time up front computing the nearest-neighbors list, and
then run JP clustering a number of time with different parameters until you
get the clusters you want.

A problem with J-P clustering in your situation is that it will tend to
merge clusters. If you have two astronomical objects that have overlapping
observations, they'll probable merge into one. You might be able to address
this with a post-processing step that identified "bimodal" or "multimodal"
clusters -- say by identifying a cluster with a distinctly asymmetrical or
eliptical outline and splitting it. But I think any good clustering package
would have trouble with these cases.

There are many other clustering algorithms, but many of them suffer from
being non-deterministic, or from requiring a number-of-clusters parameter
at the start.

That pretty much exhausts my knowledge of clustering. I hope this gives you
a little insight.

Craig

>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2014-07-29 17:27:11 Re: Very slow planning performance on partition table
Previous Message Jiří Nádvorník 2014-07-29 10:11:23 Re: Cursor + upsert (astronomical data)