From: | Charlie Hornberger <charlie(at)pressflex(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | full-text indexing, locales, triggers, SPI & more fun |
Date: | 2000-06-01 03:34:36 |
Message-ID: | 200006010334.WAA11409@SLUTMONKEY.K4AZL.NET |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I've been doing some poking at the full-text indexing code in
/contrib/fulltextindex to try to get it to work with non-ASCII locales
(among other things), but I'm having a bit of trouble trying to figure
out how to properly parse non-ASCII strings from inside the fti()
trigger function (which is written in C).
My problem is this:
I want to aggregate text in multiple languages in a single full-text index
much like the current structure used by the current fti() function. In order
to correctly parse the strings, however, I've got to know what locale
they're written in/for (otherwise, isalpha() thinks that characters such as
the Hungarian letter u" -- that's a 'u' with a double acute accent -- aren't
very alphabetic.)
My initial thinking (which could certainly be very wrong) is that the
easiest way to get around this would be to allow client apps to set their
LC_ALL environment variables, and then to have the new fti() function use
that locale while doing string manipulation.
But the way I'm doing things, it doesn't appear that the LC_ALL environment
variable is available. (Maybe it was never meant to be ... but I'm not a
very skilled C programmer, and I don't know the first thing about the SPI
interface, so please forgive me if I'm asking why the sun doesn't rise in
the west more often ;-)).
Here's what's happening:
bash# LC_ALL=hu_HU
bash# export LC_ALL
bash# psql test
Welcome to psql, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help on internal slash commands
\g or terminate with semicolon to execute query
\q to quit
test=# INSERT INTO ttxt (t1) values ('FELELÕSSÉGÛ');
INSERT 513377 1
test=#select * from ttxt_fti;
string | id
--------+--------
felel | 513377
ss | 513377
(2 rows)
Which isn't quite what I'm looking for ;-).
Inside the C source of fti(), I added a call to getenv("LC_ALL") to make
sure that LC_ALL really isn't set:
locale = getenv("LC_ALL");
elog(NOTICE,"Locale is '%s'\n",locale);
And sure enough, it outputs:
NOTICE: Locale is '(null)'
If, on the other hand, I do:
setlocale("LC_ALL","hu_HU")
inside fti(), everything works out perfectly:
test=# INSERT INTO ttxt (t1) values ('FELELÕSSÉGÛ');
INSERT 513410 1
test=# select * from ttxt_fti;
string | id
-------------+--------
felelõsségû | 513410
(1 row)
Any ideas?
Cheers,
Charlie
P.S. I only subscribe to the hackers digest, so please CC me with your
replies... Thanks!
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2000-06-01 04:51:26 | Re: pg6.4.2 eating records... |
Previous Message | Matthew Hagerty | 2000-06-01 03:29:14 | pg6.4.2 eating records... |