Re: daitch_mokotoff module

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Dag Lem <dag(at)nimrod(dot)no>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: daitch_mokotoff module
Date: 2022-12-23 13:25:59
Message-ID: 20221223132559.mauqerlf75d7jnuq@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2022-Dec-23, Alvaro Herrera wrote:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string. Maybe an array would be more appropriate. Even
> on your documented example use, the first thing you do is split it on
> spaces.

I tried downloading a list of surnames from here
https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
pasted that in a text file and \copy'ed it into a table. Then I ran
this query

select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
from apellidos
group by daitch_mokotoff(a)
order by count(*) desc;

so I have a first entry like this

string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
daitch_mokotoff │ 784000
count │ 18

but then I have a bunch of other entries with the same code 784000 as
alternative codes,

string_agg │ Velazco
daitch_mokotoff │ 784500 784000
count │ 1

string_agg │ Palacio
daitch_mokotoff │ 785000 784000
count │ 1

I suppose I need to group these together somehow, and it would make more
sense to do that if the values were arrays.

If I scroll a bit further down and choose, say, 794000 (a relatively
popular one), then I have this

string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras Peraza Peres Perez Porras Varas Veras
daitch_mokotoff │ 794000
count │ 14

and looking for that code in the result I also get these three

string_agg │ Barca Barco Parco
daitch_mokotoff │ 795000 794000
count │ 3

string_agg │ Borja
daitch_mokotoff │ 790000 794000
count │ 1

string_agg │ Borjas
daitch_mokotoff │ 794000 794400
count │ 1

and then I see that I should also search for possible matches in codes
795000, 790000 and 794400, so that gives me

string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra Perea Vera
daitch_mokotoff │ 790000
count │ 11

string_agg │ Barriga Borge Borrego Burgo Fraga
daitch_mokotoff │ 795000
count │ 5

string_agg │ Borjas
daitch_mokotoff │ 794000 794400
count │ 1

which look closely related (compare "Veras" in the first to "Vera" in
the later set. If you ignore that pseudo-match, you're likely to miss
possible family relationships.)

I suppose if I were a genealogy researcher, I would be helped by having
each of these codes behave as a separate unit, rather than me having to
split the string into the several possible contained values.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees." (E. Dijkstra)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2022-12-23 14:52:12 Re: Error-safe user functions
Previous Message Alvaro Herrera 2022-12-23 13:07:47 Re: daitch_mokotoff module