From: | Anthony Gentile <asgentile(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | N-grams |
Date: | 2011-01-13 02:37:42 |
Message-ID: | AANLkTi=Gs8obcr_suRmEOUYUXpYRVNGzO9s2TWWMqn2m@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello,
Today I was reading a blog post from a fellow coworker
http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/ and
started to mess around with the trigram contrib package for postgres and
playing with some different word dictionaries for English and German. I was
wanting to see how performant particular queries could be if SIGLENINT in
trgm.h was adjusted to be the avg character length for a particular word
dictionary
http://packages.ubuntu.com/dapper/wamerican
compling=# SELECT AVG(LENGTH(CAST(word AS bytea), 'UTF8')) FROM
english_words;
avg
--------------------
8.4498980409662267
vs
http://packages.ubuntu.com/dapper/wngerman
compling=# SELECT AVG(LENGTH(CAST(word AS bytea), 'UTF8')) FROM words;
//german
avg
---------------------
11.9518056504365566
(unsurprisingly German words are on average longer than English ones)
Effectly wanting to make the trigram package act more along the lines of
n-gram where I am explicitly setting the N when it is built. I, am however,
not very proficient in C and doubt that is the only change necessary needed
to convert the trigram contrib to an n-gram as after changing SIGLENINT to
12 in trgm.h I still get trigram results for show_trgrm() . I was hoping
someone familiar with it could provide a little help for me by perhaps
giving me a path of action needed to change the trigram implementation to
behave as an n-gram. Thanks for your time and I appreciate any advice anyone
can give me.
Anthony Gentile
From | Date | Subject | |
---|---|---|---|
Next Message | Itagaki Takahiro | 2011-01-13 02:52:47 | Re: pg_regress multibyte setting |
Previous Message | Itagaki Takahiro | 2011-01-13 02:29:59 | Re: pg_ctl failover Re: Latches, signals, and waiting |