Re: Cursor + upsert (astronomical data)

From: Craig James <cjames(at)emolecules(dot)com>
To: Jiří Nádvorník <nadvornik(dot)ji(at)gmail(dot)com>
Cc: Vitalii Tymchyshyn <vit(at)tym(dot)im>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Cursor + upsert (astronomical data)
Date: 2014-07-27 15:35:25
Message-ID: CAFwQ8rfW3B9JHqQnbQHYjdez=qOKF6a4Lc3pw=p3_ZQ6WKpitw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Jiri,

If you haven't looked at clustering algorithms yet, you might want to do
so. Your problem is a special case of clustering, where you have a large
number of small clusters. A good place to start is the overview on
Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis

A lot of people have worked extensively on this problem, and you might find
a good solution, or at least some ideas to guide your own algorithm. In my
field (chemistry), researchers often need to cluster 10^6 to 10^7 chemical
compounds, and a great deal of research has gone into efficient ways to do
so.

Craig

On Sun, Jul 27, 2014 at 7:35 AM, Jiří Nádvorník <nadvornik(dot)ji(at)gmail(dot)com>
wrote:

> Hi Vitalii, thank you for your reply.
>
>
>
> The problem you suggested can in the most pathological way be, that these
> observations are on one line. As you suggested it, the B would be in the
> middle. So A and C are not in 1 arcsec range of each other, but they must
> be within 1 arcsec of their common average coordinates. If the distances
> between A,B,C are 1 arcsec for each, the right solution is to pick B as
> reference identifier and assign A and C to it.
>
>
>
> We already tried the approach you suggest with applying a grid based on
> the Q3C indexing of the database. We were not just rounding the results,
> but using the center of the Q3C “square” in which the observation took
> place. The result was poor however – 22% of the identifiers were closer to
> each other than 1 arcsec. That means that when you crossmatch the original
> observations to them, you don’t know which one to use and you have
> duplicates. The reason for this is that nearly all of the observations are
> from SMC (high density of observations), which causes that you have more
> than 2 “rounded” positions in a row and don’t know which ones to join
> together (compute average coordinates from it). If it is not clear enough I
> can draw it on an image for you.
>
> Maybe the simple round up would have better results because the squares
> are not each the same size and you can scale them only by 2 (2-times
> smaller, or larger square). We used a squre with the side cca 0.76 arcsec
> which approximately covers the 1 arcsec radius circle.
>
>
>
> Oh and one more important thing. The difficulty of our data is not that it
> is 3e8 rows. But in the highest density, there are cca 1000 images
> overlapping. Which kills you when you try to self-join the observations to
> find neighbours for each of them – the quadratic complexity is based on the
> overlappingon the image (e.g. 10000 observations on one image with another
> 999 images overlapping it means 10000 *1000^2).
>
>
>
> Best regards,
>
>
>
> Jiri Nadvornik
>
>
>
> *From:* tivv00(at)gmail(dot)com [mailto:tivv00(at)gmail(dot)com] *On Behalf Of *Vitalii
> Tymchyshyn
> *Sent:* Sunday, July 27, 2014 8:06 AM
> *To:* Jiří Nádvorník
> *Cc:* pgsql-performance(at)postgresql(dot)org
> *Subject:* Re: [PERFORM] Cursor + upsert (astronomical data)
>
>
>
> I am not sure I understand the problem fully, e.g. what to do if there are
> observations A,B and C with A to B and B to C less then treshold and A to C
> over treshold, but anyway.
>
> Could you first apply a kind of grid to your observations? What I mean is
> to round your coords to, say, 1/2 arcsec on each axe and group the results.
> I think you will have most observations grouped this way and then use your
> regular algorithm to combine the results.
>
> Best regards, Vitalii Tymchyshyn
>

--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Rural Hunter 2014-07-27 15:44:09 Re: Very slow planning performance on partition table
Previous Message Jiří Nádvorník 2014-07-27 14:35:21 Re: Cursor + upsert (astronomical data)