From: | Sushant Sinha <sushant354(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | sushant354(at)gmail(dot)com, Robert Haas <robertmhaas(at)gmail(dot)com>, Markus Wanner <markus(at)bluegap(dot)ch>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: english parser in text search: support for multiple words in the same position |
Date: | 2010-09-29 05:29:47 |
Message-ID: | AANLkTimOL8U68putCUUyhTnABbXu4pGpx740T4ENb-Wf@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Any updates on this?
On Tue, Sep 21, 2010 at 10:47 PM, Sushant Sinha <sushant354(at)gmail(dot)com>wrote:
> > I looked at this patch a bit. I'm fairly unhappy that it seems to be
> > inventing a brand new mechanism to do something the ts parser can
> > already do. Why didn't you code the url-part mechanism using the
> > existing support for compound words?
>
> I am not familiar with compound word implementation and so I am not sure
> how to split a url with compound word support. I looked into the
> documentation for compound words and that does not say much about how to
> identify components of a token. Does a compound word split by matching
> with a list of words? If yes, then we will not be able to use that as we
> do not know all the words that can appear in a url/host/email/file.
>
> I think another approach can be to use the dict_regex dictionary
> support. However, we will have to match the regex with something that
> parser is doing.
>
> The current patch is not inventing any new mechanism. It uses the
> special handler mechanism already present in the parser. For example,
> when the current parser finds a URL it runs a special handler called
> SpecialFURL which resets the parser position to the start of token to
> find hostname. After finding the host it moves to finding the path. So
> you first get the URL and then the host and finally the path.
>
> Similarly, we are resetting the parser to the start of the token on
> finding a url to output url parts. Then before entering the state that
> can lead to a url we output the url part. The state machine modification
> is similar for other tokens like file/email/host.
>
>
> > The changes made to parsetext()
> > seem particularly scary: it's not clear at all that that's not breaking
> > unrelated behaviors. In fact, the changes in the regression test
> > results suggest strongly to me that it *is* breaking things. Why are
> > there so many diffs in examples that include no URLs at all?
> >
>
> I think some of the difference is coming from the fact that now pos
> starts with 0 and it used to be 1 earlier. That is easily fixable
> though.
>
> > An issue that's nearly as bad is the 100% lack of documentation,
> > which makes the patch difficult to review because it's hard to tell
> > what it intends to accomplish or whether it's met the intent.
> > The patch is not committable without documentation anyway, but right
> > now I'm not sure it's even usefully reviewable.
>
> I did not provide any explanation as I could not find any place in the
> code to provide the documentation (that was just a modification of state
> machine). Should I do a separate write-up to explain the desired output
> and the changes to achieve it?
>
> >
> > In line with the lack of documentation, I would say that the choice of
> > the name "parttoken" for the new token type is not helpful. Part of
> > what? And none of the other token type names include the word "token",
> > so that's not a good decision either. Possibly "url_part" would be a
> > suitable name.
> >
>
> I can modify it to output url-part/host-part/email-part/file-part if
> there is an agreement over the rest of the issues. So let me know if I
> should go ahead with this.
>
> -Sushant.
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Smith | 2010-09-29 06:44:35 | Re: ask for review of MERGE |
Previous Message | Itagaki Takahiro | 2010-09-29 05:25:38 | Re: I: About "Our CLUSTER implementation is pessimal" patch |