From: | Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Subject: | Re: PATCH: Update snowball stemmers |
Date: | 2018-09-25 11:45:08 |
Message-ID: | 20180925114506.GA14666@zakirov.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, Sep 24, 2018 at 05:36:39PM -0400, Tom Lane wrote:
> I reviewed and pushed this.
Great! Thank you.
> As a cross-check on the patch, I cloned the Snowball github repo
> and built the derived files in it. I noticed that they'd incorporated
> several new stemmers since 2007 --- not only your Nepali one, but
> half a dozen more besides. Since the point here is (IMO) mostly to
> follow their lead on what's interesting, I went ahead and added those
> as well.
Agree. It is good decision. It may attract more users.
> Although I added nepali.stop from the other patch, I've not done
> anything about updating our other stopword lists. Presumably those
> are a bit obsolete by now as well. I wonder if we can prevail on
> the Snowball people to make those available in some less painful way
> than scraping them off assorted web pages. Ideally they'd stick them
> into their git repo ...
They have repository snowball-website [1]. It is snowballstem.org
web-site source repository. It also stores stopwords for various
languages (for example for english [2]). I checked couple languages. It
seems their russian and danish stopword lists look like PostgreSQL's
stopword lists. But their english stopword list is different.
There is lack of stopword lists for the following languages:
- arabic
- irish
- lithuanian
- nepali - I can create a pull request to add it to snowball-website
- tamil
There is also another project, called Stopwords ISO [3]. But I'm not
sure about them. It stores stopword lists from various sources. And also
there are stopwords for languages mentioned above, except for nepali and
tamil.
I think I could make a script, which generates stopwords from
snowball-website repository. It can be run periodically. Is it suitable?
Also it would be good to move missing stopwords from Stopwords ISO to
snowball-website...
1 - https://github.com/snowballstem/snowball-website/tree/master/algorithms
2 - https://github.com/snowballstem/snowball-website/blob/master/algorithms/english/stop.txt
3 - https://github.com/stopwords-iso
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
From | Date | Subject | |
---|---|---|---|
Next Message | Christoph Berg | 2018-09-25 11:46:22 | Re: Collation versioning |
Previous Message | Dmitry Dolgov | 2018-09-25 11:39:59 | Re: Segfault when creating partition with a primary key and sql_drop trigger exists |