From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Dag Lem <dag(at)nimrod(dot)no> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Julien Rouhaud <rjuju123(at)gmail(dot)com> |
Subject: | Re: daitch_mokotoff module |
Date: | 2022-12-07 18:56:56 |
Message-ID: | 20221207185656.nlaj2o5kcbgbadcp@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2022-02-03 15:27:32 +0100, Dag Lem wrote:
> Just some minor adjustments to the patch:
>
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization
This patch currently fails in cfbot, likely because meson.build needs to be
adjusted (this didn't exist at the time you submitted this version of the
patch):
[23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File fuzzystrmatch--1.1.sql does not exist.
> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
> PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
that. In recent years our approach has been to just keep the "base version" of
the upgrade script, with extension creation running through the upgrade
scripts.
>
> +
> +#include "daitch_mokotoff.h"
> +
> +#include "postgres.h"
Postgres policy is that the include of "postgres.h" has to be the first
include in every .c file.
> +#include "utils/builtins.h"
> +#include "mb/pg_wchar.h"
> +
> +#include <string.h>
> +
> +/* Internal C implementation */
> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
> +
> +
> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
> +Datum
> +daitch_mokotoff(PG_FUNCTION_ARGS)
> +{
> + text *arg = PG_GETARG_TEXT_PP(0);
> + char *string,
> + *tmp_soundex;
> + text *soundex;
> +
> + /*
> + * The maximum theoretical soundex size is several KB, however in practice
> + * anything but contrived synthetic inputs will yield a soundex size of
> + * less than 100 bytes. We thus allocate and free a temporary work buffer,
> + * and return only the actual soundex result.
> + */
> + string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), PG_UTF8);
> + tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
Seems that just using StringInfo to hold the soundex output would work better
than a static allocation?
> + if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
We imo shouldn't introduce new functions starting with _.
> +/* Mark soundex code tree node as leaf. */
> +static void
> +set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
> +{
> + if (!node->is_leaf)
> + {
> + node->is_leaf = 1;
> + leaves_next[(*num_leaves_next)++] = node;
> + }
> +}
> +
> +
> +/* Find next node corresponding to code digit, or create a new node. */
> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
> + dm_node * node, char code_digit)
PG code style is to have a line break between a function defintion's return
type and the function name - like you actually do above.
> +/* Mapping from ISO8859-1 to upper-case ASCII */
> +static const char tr_iso8859_1_to_ascii_upper[] =
> +/*
> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
> +*/
> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ ! ?AAAAAAECEEEEIIIIDNOOOOO*OUUUUYDSAAAAAAECEEEEIIIIDNOOOOO/OUUUUYDY";
> +
> +static char
> +iso8859_1_to_ascii_upper(unsigned char c)
> +{
> + return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
> +}
> +
> +
> +/* Convert an UTF-8 character to ISO-8859-1.
> + * Unconvertable characters are returned as '?'.
> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
> + */
> +static char
> +utf8_to_iso8859_1(char *str, int *ix)
It seems decidedly not great to have custom encoding conversion routines in a
contrib module. Is there any way we can avoid this?
> +/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. */
> +static char *
> +_daitch_mokotoff(char *word, char *soundex, size_t n)
> +{
> + int i = 0,
> + j;
> + int letter_no = 0;
> + int ix_leaves = 0;
> + int num_nodes = 0,
> + num_leaves = 0;
> + dm_codes *codes,
> + *next_codes;
> + dm_node *nodes;
> + dm_leaves *leaves;
> +
> + /* First letter. */
> + if (!(codes = read_letter(word, &i)))
> + {
> + /* No encodable character in input. */
> + return NULL;
> + }
> +
> + /* Allocate memory for node tree. */
> + nodes = palloc(sizeof(dm_nodes));
> + leaves = palloc(2 * sizeof(dm_leaves));
So this allocates the worst case memory usage, is that right? That's quite a
bit of memory. Shouldn't nodes be allocated dynamically?
Instead of carefully freeing individual memory allocations, I think it be
better to create a temporary memory context, allocate the necessary nodes etc
on demand, and destroy the temporary memory context at the end.
> +/* Codes for letter sequence at start of name, before a vowel, and any other. */
> +static dm_codes codes_0_1_X[2] =
Any reason these aren't all const?
It's not clear to me where the intended line between the .h and .c file is.
> +print <<EOF;
> +/*
> + * Types and lookup tables for Daitch-Mokotoff Soundex
> + *
If we generate the code, why is the generated header included in the commit?
> +/* Letter in input sequence */
> +struct dm_letter
> +{
> + char letter; /* Present letter in sequence */
> + struct dm_letter *letters; /* List of possible successive letters */
> + dm_codes *codes; /* Code sequence(s) for complete sequence */
> +};
> +
> +/* Node in soundex code tree */
> +struct dm_node
> +{
> + int soundex_length; /* Length of generated soundex code */
> + char soundex[DM_MAX_CODE_DIGITS + 1]; /* Soundex code */
> + int is_leaf; /* Candidate for complete soundex code */
> + int last_update; /* Letter number for last update of node */
> + char code_digit; /* Last code digit, 0 - 9 */
> +
> + /*
> + * One or two alternate code digits leading to this node. If there are two
> + * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
> + * back to the same node.
> + */
> + char prev_code_digits[2];
> + /* One or two alternate code digits moving forward. */
> + char next_code_digits[2];
> + /* ORed together code index(es) used to reach current node. */
> + int prev_code_index;
> + int next_code_index;
> + /* Nodes branching out from this node. */
> + struct dm_node *next_nodes[DM_MAX_ALTERNATE_CODES + 1];
> +};
> +
> +typedef struct dm_letter dm_letter;
> +typedef struct dm_node dm_node;
Why is all this in the generated header? It needs DM_MAX_ALTERNATE_CODES etc,
but it seems that the structs could just be defined in the .c file.
> +# Table adapted from https://www.jewishgen.org/InfoFiles/Soundex.html
What does "adapted" mean here? And what's the path to updating the data?
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2022-12-07 18:59:43 | Re: Partial aggregates pushdown |
Previous Message | Andres Freund | 2022-12-07 18:23:37 | Re: Transaction timeout |