Re: Full text: Ispell dictionary

From: Oleg Bartunov <obartunov(at)gmail(dot)com>
To: Tim van der Linden <tim(at)shisaa(dot)jp>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Full text: Ispell dictionary
Date: 2014-05-07 20:00:00
Message-ID: CAF4Au4wytyVOvOwHH_Aft+HRXutcBShHoKFkJmOVaJdAsruJ9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

btw, take a look on contrib/dict_xsyn, it's more powerful than
synonym dictionary.

On Sat, May 3, 2014 at 2:26 AM, Tim van der Linden <tim(at)shisaa(dot)jp> wrote:
> Hi Oleg
>
> Haha, understood!
>
> Thanks for helping me on this one.
>
> Cheers
> Tim
>
>
> On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov <obartunov(at)gmail(dot)com>
> wrote:
>>
>> Tim,
>>
>> you did answer yourself - don't use ispell :)
>>
>> On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden <tim(at)shisaa(dot)jp> wrote:
>>>
>>> On Fri, 2 May 2014 21:12:56 +0400
>>> Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:
>>>
>>> Hi Oleg
>>>
>>> Thanks for the response!
>>>
>>>> Yes, it's normal for ispell dictionary, think about morphological
>>>> dictionary.
>>>
>>>
>>> Hmm, I see, that makes sense. I thought the morphological aspect of the
>>> Ispell only dealt with splitting up compound words, but it also deals with
>>> deriving the word to a more "stem" like form, correct?
>>>
>>> As a last question on this, is there a way to disable this dictionary to
>>> emit multiple lexemes?
>>>
>>>
>>> The reason I am asking is because in my (fairly new) understanding of
>>> PostgreSQL's full text it is always best to have as few lexemes as possible
>>> saved in the vector. This to get smaller indexes and faster matching
>>> afterwards. Also, if you run a tsquery afterwards to, you can still employ
>>> the power of these multiple lexemes to find a match.
>>>
>>> Or...probably answering my own question...if I do not desire this
>>> behavior I should maybe not use Ispell and simply use another dictionary :)
>>>
>>> Thanks again.
>>>
>>> Cheers,
>>> Tim
>>>
>>>> On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden <tim(at)shisaa(dot)jp>
>>>> wrote:
>>>>>
>>>>> Good morning/afternoon all
>>>>>
>>>>> I am currently writing a few articles about PostgreSQL's full text
>>>>> capabilities and have a question about the Ispell dictionary which I
>>>>> cannot seem to find an answer to. It is probably a very simple issue, so
>>>>> forgive my ignorance.
>>>>>
>>>>> In one article I am explaining about dictionaries and I have setup a
>>>>> sample configuration which maps most token categories to only use a Ispell
>>>>> dictionary (timusan_ispell) which has a default configuration:
>>>>>
>>>>> CREATE TEXT SEARCH DICTIONARY timusan_ispell (
>>>>> TEMPLATE = ispell,
>>>>> DictFile = en_us,
>>>>> AffFile = en_us,
>>>>> StopWords = english
>>>>> );
>>>>>
>>>>> When I run a simple query like "SELECT
>>>>> to_tsvector('timusan-ispell','smiling')" I get back the following tsvector:
>>>>>
>>>>> 'smile':1 'smiling':1
>>>>>
>>>>> As you can see I get two lexemes with the same pointer.
>>>>> The question here is: why does this happen?
>>>>>
>>>>> Is it normal behavior for the Ispell dictionary to emit multiple
>>>>> lexemes for a single token? And if so, is this efficient? I
>>>>> mean, why could it not simply save one lexeme 'smile' which (same as
>>>>> the snowball dictionary) would match 'smiling' as well if later matched with
>>>>> the accompanying tsquery?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Cheers,
>>>>> Tim
>>>>>
>>>>>
>>>>> --
>>>>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-general
>>>
>>>
>>>
>>> --
>>> Tim van der Linden <tim(at)shisaa(dot)jp>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Randy Westlund 2014-05-07 20:36:18 Enforce Symmetric Matrix
Previous Message David G Johnston 2014-05-07 18:43:05 Re: How to fix lost synchronization with server