Improving FTS for Greek

From: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Improving FTS for Greek
Date: 2023-06-03 17:47:12
Message-ID: 9E76CD3A-646A-460D-A4D2-8A56E99DD4D8@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I posted earlier in pgsql-general, that I realised there’s no greek.stop under $(pg_config —sharedir)/tsearch_data

And indeed looks like stop words are maintained with to_tsvector(‘greek’, ..).

I wrote an extension https://github.com/Florents-Tselai/pg_fts_greek that adds another ‘greek_ext’ regconfig

Here’s how the results compare

t to_tsvector('greek', t) to_tsvector('greek_ext', t)
'το τετράγωνο της υποτείνουσας ενός ορθογωνίου τριγώνου' 'εν':5 'ορθογων':6 'τ':3 'τετραγων':2 'το':1 'τριγων':7 'υποτεινουσ':4 'εν':5 'ορθογων':6 'τετραγων':2 'τριγων':7 'υποτεινουσ':4
'ο γιώργος είναι πονηρός' 'γιωργ':2 'εινα':3 'ο':1 'πονηρ':4 'γιωργ':2 'πονηρ':4
'ο ήλιος ο πράσινος o ήλιος που ανατέλλει' 'o':5 'ανατελλ':8 'ηλι':2,6 'ο':1,3 'π':7 'πρασιν':4 'ανατελλ':8 'ηλι':2,6 'πρασιν':4

There’s another previous relevant patch [0] but was never merged. I’ve included these stop words and added some more (info in README.md).

For my personal projects looks like it yields much better results.

I’d like some feedback on the extension ; particularly on the installation infra (I’m not sure I’ve handled properly the permissions in the .sql files)

I’ll then try to make a .patch for this.

[0] https://www.postgresql.org/message-id/flat/e1c79330-48a5-abef-c309-8d4499e3180b%402ndquadrant.com#7431fdb9ae24b694155aef3f040b7b60

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-06-03 18:28:11 Re: [BUG] pg_dump does not properly deal with BEGIN ATOMIC function
Previous Message Aleksander Alekseev 2023-06-03 15:09:16 Re: Should "REGRESS_OPTS = --temp-config" be working for 3rd party extensions?