Re: Cursor + upsert (astronomical data)

From: Vitalii Tymchyshyn <vit(at)tym(dot)im>
To: Jiří Nádvorník <nadvornik(dot)ji(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Cursor + upsert (astronomical data)
Date: 2014-07-27 20:09:41
Message-ID: CABWW-d2Y4GVKXLfap2h+bDn3Y0YA_Dj=+OD8xN=NQCem+hQxDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Well, that's why I said to apply regular algorithm to deduplicate after
this step. Basically, what I expect is to have first pass with group by
that do not require any joins and produces "dirty" set of identifiers.

It should do next things:
1) Provide working set of dirty identifiers that has a huge factor less
cardinality than the original observations set.
2) Most of the identifiers can be used as is, only for small fraction you
need to perform additional merge. 22% is actually very good number, it
means only 1/5 of identifiers should be analyzed for merging.

Best regards, Vitalii Tymchyshyn
27 лип. 2014 10:35, користувач "Jiří Nádvorník" <nadvornik(dot)ji(at)gmail(dot)com>
написав:

> Hi Vitalii, thank you for your reply.
>
>
>
> The problem you suggested can in the most pathological way be, that these
> observations are on one line. As you suggested it, the B would be in the
> middle. So A and C are not in 1 arcsec range of each other, but they must
> be within 1 arcsec of their common average coordinates. If the distances
> between A,B,C are 1 arcsec for each, the right solution is to pick B as
> reference identifier and assign A and C to it.
>
>
>
> We already tried the approach you suggest with applying a grid based on
> the Q3C indexing of the database. We were not just rounding the results,
> but using the center of the Q3C “square” in which the observation took
> place. The result was poor however – 22% of the identifiers were closer to
> each other than 1 arcsec. That means that when you crossmatch the original
> observations to them, you don’t know which one to use and you have
> duplicates. The reason for this is that nearly all of the observations are
> from SMC (high density of observations), which causes that you have more
> than 2 “rounded” positions in a row and don’t know which ones to join
> together (compute average coordinates from it). If it is not clear enough I
> can draw it on an image for you.
>
> Maybe the simple round up would have better results because the squares
> are not each the same size and you can scale them only by 2 (2-times
> smaller, or larger square). We used a squre with the side cca 0.76 arcsec
> which approximately covers the 1 arcsec radius circle.
>
>
>
> Oh and one more important thing. The difficulty of our data is not that it
> is 3e8 rows. But in the highest density, there are cca 1000 images
> overlapping. Which kills you when you try to self-join the observations to
> find neighbours for each of them – the quadratic complexity is based on the
> overlappingon the image (e.g. 10000 observations on one image with another
> 999 images overlapping it means 10000 *1000^2).
>
>
>
> Best regards,
>
>
>
> Jiri Nadvornik
>
>
>
> *From:* tivv00(at)gmail(dot)com [mailto:tivv00(at)gmail(dot)com] *On Behalf Of *Vitalii
> Tymchyshyn
> *Sent:* Sunday, July 27, 2014 8:06 AM
> *To:* Jiří Nádvorník
> *Cc:* pgsql-performance(at)postgresql(dot)org
> *Subject:* Re: [PERFORM] Cursor + upsert (astronomical data)
>
>
>
> I am not sure I understand the problem fully, e.g. what to do if there are
> observations A,B and C with A to B and B to C less then treshold and A to C
> over treshold, but anyway.
>
> Could you first apply a kind of grid to your observations? What I mean is
> to round your coords to, say, 1/2 arcsec on each axe and group the results.
> I think you will have most observations grouped this way and then use your
> regular algorithm to combine the results.
>
> Best regards, Vitalii Tymchyshyn
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Rural Hunter 2014-07-28 01:18:59 Re: Very slow planning performance on partition table
Previous Message Marc Mamin 2014-07-27 19:05:22 Re: Cursor + upsert (astronomical data)