From: | Kenneth Marshall <ktm(at)rice(dot)edu> |
---|---|
To: | Sushant Sinha <sushant354(at)gmail(dot)com> |
Cc: | Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-hackers(at)postgresql(dot)org, shamnad(at)gmail(dot)com |
Subject: | Re: dot to be considered as a word delimiter? |
Date: | 2009-06-02 20:57:49 |
Message-ID: | 20090602205749.GJ18879@it.is.rice.edu |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jun 02, 2009 at 04:40:51PM -0400, Sushant Sinha wrote:
> Fair enough. I agree that there is a valid need for returning such tokens as
> a host. But I think there is definitely a need to break it down into
> individual words. This will help in cases when a document is missing a space
> in between the words.
>
>
> So what we can do is: return the entire compound word as Host and also break
> it down into individual words. I can put up a patch for this if you guys
> agree.
>
> Returning multiple tokens for the same word is a feature of the text search
> parser as explained in the documentation here:
> http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html
>
> Thanks,
> Sushant.
>
+1
Ken
> On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall <ktm(at)rice(dot)edu> wrote:
>
> > On Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> > > Sushant Sinha <sushant354(at)gmail(dot)com> wrote:
> > >
> > > > I think that dot should be considered by as a word delimiter because
> > > > when dot is not followed by a space, most of the time it is an error
> > > > in typing. Beside they are not many valid english words that have
> > > > dot in between.
> > >
> > > It's not treating it as an English word, but as a host name.
> > >
> > > select ts_debug('english', 'Mr.J.Sai Deepak');
> > > ts_debug
> > >
> > ---------------------------------------------------------------------------
> > > (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> > > (blank,"Space symbols"," ",{},,)
> > > (asciiword,"Word, all
> > > ASCII",Deepak,{english_stem},english_stem,{deepak})
> > > (3 rows)
> > >
> > > You could run it through a dictionary which would deal with host
> > > tokens differently. Just be aware of what you'll be doing to
> > > www.google.com if you run into it.
> > >
> > > I hope this helps.
> > >
> > > -Kevin
> > >
> >
> > In our uses for full text indexing, it is much more important to
> > be able to find host name and URLs than to find mistyped names.
> > My two cents.
> >
> > Cheers,
> > Ken
> >
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2009-06-02 21:01:58 | Re: Managing multiple branches in git |
Previous Message | Kevin Grittner | 2009-06-02 20:57:02 | Re: dot to be considered as a word delimiter? |