From: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
---|---|
To: | Dag Lem <dag(at)nimrod(dot)no> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: daitch_mokotoff module |
Date: | 2022-12-23 13:25:59 |
Message-ID: | 20221223132559.mauqerlf75d7jnuq@alvherre.pgsql |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2022-Dec-23, Alvaro Herrera wrote:
> I wonder why do you have it return the multiple alternative codes as a
> space-separated string. Maybe an array would be more appropriate. Even
> on your documented example use, the first thing you do is split it on
> spaces.
I tried downloading a list of surnames from here
https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
pasted that in a text file and \copy'ed it into a table. Then I ran
this query
select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
from apellidos
group by daitch_mokotoff(a)
order by count(*) desc;
so I have a first entry like this
string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
daitch_mokotoff │ 784000
count │ 18
but then I have a bunch of other entries with the same code 784000 as
alternative codes,
string_agg │ Velazco
daitch_mokotoff │ 784500 784000
count │ 1
string_agg │ Palacio
daitch_mokotoff │ 785000 784000
count │ 1
I suppose I need to group these together somehow, and it would make more
sense to do that if the values were arrays.
If I scroll a bit further down and choose, say, 794000 (a relatively
popular one), then I have this
string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras Peraza Peres Perez Porras Varas Veras
daitch_mokotoff │ 794000
count │ 14
and looking for that code in the result I also get these three
string_agg │ Barca Barco Parco
daitch_mokotoff │ 795000 794000
count │ 3
string_agg │ Borja
daitch_mokotoff │ 790000 794000
count │ 1
string_agg │ Borjas
daitch_mokotoff │ 794000 794400
count │ 1
and then I see that I should also search for possible matches in codes
795000, 790000 and 794400, so that gives me
string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra Perea Vera
daitch_mokotoff │ 790000
count │ 11
string_agg │ Barriga Borge Borrego Burgo Fraga
daitch_mokotoff │ 795000
count │ 5
string_agg │ Borjas
daitch_mokotoff │ 794000 794400
count │ 1
which look closely related (compare "Veras" in the first to "Vera" in
the later set. If you ignore that pseudo-match, you're likely to miss
possible family relationships.)
I suppose if I were a genealogy researcher, I would be helped by having
each of these codes behave as a separate unit, rather than me having to
split the string into the several possible contained values.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees." (E. Dijkstra)
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2022-12-23 14:52:12 | Re: Error-safe user functions |
Previous Message | Alvaro Herrera | 2022-12-23 13:07:47 | Re: daitch_mokotoff module |