| From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> | 
|---|---|
| To: | Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> | 
| Cc: | PgSQL General ML <pgsql-general(at)postgresql(dot)org> | 
| Subject: | Re: Initial ugly reverse-translator | 
| Date: | 2008-04-19 16:38:13 | 
| Message-ID: | 10234.1208623093@sss.pgh.pa.us | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> Tom Lane wrote:
>> I don't really see the problem.  I assume from your reference to pg_trgm
>> that you're using trigram similarity as the prefilter for potential
>> matches
> It turns out that's no good anyway, as it appears to ignore characters 
> outside the ASCII range. Rather less than useful for searching a 
> database of translated strings ;-)
A quick look at the pg_trgm code suggests that it is only prepared to
deal with single-byte encodings; if you're working in UTF8, which I
suppose you'd have to be, it's dead in the water :-(.  Perhaps fixing
that should be on the TODO list.
But in any case maybe the full-text-search stuff would be more useful
as a prefilter?  Although honestly, for the speed we need here, I'm
not sure a prefilter is needed at all.  Full text might be useful
if a LIKE-based match fails, though.
>> (And besides, speed doesn't seem like the be-all and end-all here.)
> True. It's not so much the speed as the fragility when faced with small 
> changes to formatting. In addition to whitespace, some clients mangle 
> punctuation with features like automatic "curly"-quoting.
Yeah.  I was wondering whether encoding differences wouldn't be a huge
problem in practice, as well.
regards, tom lane
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Oleg Bartunov | 2008-04-19 17:10:38 | Re: Initial ugly reverse-translator | 
| Previous Message | Craig Ringer | 2008-04-19 16:04:22 | Re: Initial ugly reverse-translator |