Re: Full text: Ispell dictionary

From: Tim van der Linden <tim(at)shisaa(dot)jp>
To: obartunov(at)gmail(dot)com,Oleg Bartunov <obartunov(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Full text: Ispell dictionary
Date: 2014-05-02 22:26:34
Message-ID: 736def44-35ea-4f05-897c-609b06ab3db7@email.android.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Oleg

Haha, understood!

Thanks for helping me on this one.

Cheers
Tim

On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:
>Tim,
>
>you did answer yourself - don't use ispell :)
>
>On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden <tim(at)shisaa(dot)jp>
>wrote:
>> On Fri, 2 May 2014 21:12:56 +0400
>> Oleg Bartunov <obartunov(at)gmail(dot)com> wrote:
>>
>> Hi Oleg
>>
>> Thanks for the response!
>>
>>> Yes, it's normal for ispell dictionary, think about morphological
>dictionary.
>>
>> Hmm, I see, that makes sense. I thought the morphological aspect of
>the Ispell only dealt with splitting up compound words, but it also
>deals with deriving the word to a more "stem" like form, correct?
>>
>> As a last question on this, is there a way to disable this dictionary
>to emit multiple lexemes?
>>
>> The reason I am asking is because in my (fairly new) understanding of
>PostgreSQL's full text it is always best to have as few lexemes as
>possible saved in the vector. This to get smaller indexes and faster
>matching afterwards. Also, if you run a tsquery afterwards to, you can
>still employ the power of these multiple lexemes to find a match.
>>
>> Or...probably answering my own question...if I do not desire this
>behavior I should maybe not use Ispell and simply use another
>dictionary :)
>>
>> Thanks again.
>>
>> Cheers,
>> Tim
>>
>>> On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden <tim(at)shisaa(dot)jp>
>wrote:
>>> > Good morning/afternoon all
>>> >
>>> > I am currently writing a few articles about PostgreSQL's full text
>capabilities and have a question about the Ispell dictionary which I
>cannot seem to find an answer to. It is probably a very simple issue,
>so forgive my ignorance.
>>> >
>>> > In one article I am explaining about dictionaries and I have setup
>a sample configuration which maps most token categories to only use a
>Ispell dictionary (timusan_ispell) which has a default configuration:
>>> >
>>> > CREATE TEXT SEARCH DICTIONARY timusan_ispell (
>>> > TEMPLATE = ispell,
>>> > DictFile = en_us,
>>> > AffFile = en_us,
>>> > StopWords = english
>>> > );
>>> >
>>> > When I run a simple query like "SELECT
>to_tsvector('timusan-ispell','smiling')" I get back the following
>tsvector:
>>> >
>>> > 'smile':1 'smiling':1
>>> >
>>> > As you can see I get two lexemes with the same pointer.
>>> > The question here is: why does this happen?
>>> >
>>> > Is it normal behavior for the Ispell dictionary to emit multiple
>lexemes for a single token? And if so, is this efficient? I mean, why
>could it not simply save one lexeme 'smile' which (same as the snowball
>dictionary) would match 'smiling' as well if later matched with the
>accompanying tsquery?
>>> >
>>> > Thanks!
>>> >
>>> > Cheers,
>>> > Tim
>>> >
>>> >
>>> > --
>>> > Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>>> > To make changes to your subscription:
>>> > http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>> --
>> Tim van der Linden <tim(at)shisaa(dot)jp>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alban Hertroys 2014-05-03 09:51:08 Re: Optimize query for listing un-read messages
Previous Message Oleg Bartunov 2014-05-02 22:26:30 Re: Manipulating jsonb