From: | Rémi Cura <remi(dot)cura(at)gmail(dot)com> |
---|---|
To: | Andy Colson <andy(at)squeakycode(dot)net> |
Cc: | PostgreSQL General <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: efficient way to do "fuzzy" join |
Date: | 2014-04-11 18:18:41 |
Message-ID: | CAJvUf_vE4N0aJwMAN-rhLbgdwJz+t0NyjEHa_zMdTp7A4LbtzA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Wow many thanks!
I had thought about the order by and limit because it is the natural way to
express the problem,
but I had discarded it for fear of suchbad complexity
(theoretically, for each row of B , compute the distance to every other row
of A!)
.
And it's okay if 2 row from B share the same join to row from A, because
when interpolating it will be different.
Here is the test env with realistic number, your solution is very fast, I
have to raise my hat (4 sec!)
-------------------------------------------------------
--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM
(random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;
--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(t float, data text);
create table b(t float, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--filling tables
WITH the_serie AS (
SELECT s+random()/2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into a SELECT s, data
FROM the_serie;
WITH the_serie AS (
SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s
)
insert into b SELECT s, data
FROM the_serie;
ANALYZE a;
ANALYZE b;
--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
--------------------------------------------------------
2014-04-11 19:16 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net>:
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>
>> Hey dear List,
>>
>> I'm looking for some advice about the best way to perform a "fuzzy"
>> join, that is joining two table based on approximate matching.
>>
>> It is about temporal matching
>> given a table A with rows containing data and a control_time (for
>> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>>
>> given another table B with lines on no precise timing (eg control_time =
>> 2.3 ; 5.8 ; 6.2 for example)
>>
>> How to join every row of B to A based on
>> min(@(A.control_time-B.control_time))
>> (that is, for every row of B, get the row of A that is temporaly the
>> closest),
>> in an efficient way?
>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>
>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>> the previous time and next time for 1 st order interpolation, 2 before
>> and 2 after for 2nd order interpolation, and so on)?
>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>> respectively)
>>
>>
>> Currently my data is spatial so I use Postgis function to interpolate a
>> point on a line, but is is far from efficient or general, and I don't
>> have control on interpolation (only the spatial values are interpolated).
>>
>>
>> Cheers,
>> Rémi-C
>>
>
>
> Ok, here is a just sql way. No ranges. No idea if its right. A first
> pass, so to speak.
>
>
>
> create table a(t float, data text);
> create table b(t float, data text);
>
> insert into a values (1), (5), (6);
> insert into b values (2.3), (5.8), (6.2);
>
>
> select a.t, b.t
> from (
> select t, least( least(t, mint), least(t, maxt)) as t2 from (
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
> ) as tmp
> ) as tmp2
> inner join a on (tmp2.t2 = a.t)
> inner join b on (tmp2.t = b.t)
>
>
>
>
> The middle part is the magic:
>
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> The rest is just to make it usable. If t is indexed, it'll probably be
> fast too.
>
> -Andy
>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | David G Johnston | 2014-04-11 18:24:50 | Re: Problem with query |
Previous Message | Andy Colson | 2014-04-11 18:12:15 | Re: efficient way to do "fuzzy" join |