Re: Out of the box, full text search feature suggestion for postgresql

From: Artur Zakirov <zaartur(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: aa <ghevge(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: Out of the box, full text search feature suggestion for postgresql
Date: 2024-01-02 17:20:51
Message-ID: CAKNkYnzheAEsB9MM6b9jEBn+W7j1T5Qh6OyogH3f8ZX8M+9gkw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, 28 Dec 2023 at 17:46, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> On Thu, Dec 28, 2023 at 10:15:07AM -0500, aa wrote:
> > Hello Postgres Team!
> >
> > First of all, a big THANK YOU for the great work you folks are doing!
> >
> > The reason I am writing to you is to suggest a feature in future Postgres
> > versions, a feature that is partially there but is not quite where it should be
> > in my opinion: the full text search functionality. This functionality in my
> > opinion, should be available out of the box, for any possible language
> > available, including east Asia character based languages. You would probably
> > say that this will require a huge amount of work, and I would say, a postgres
> > extension which does exactly this, already exists, and it is called : pgroonga
> > (https://pgroonga.github.io/)
>
> Please explain how this is different from what we already have:
>
> https://www.postgresql.org/docs/current/textsearch.html

I'm not familiar with pgroonga, but the main issue with built-in text
search is that it cannot tokenize asian and many other languages
properly.

Here default parser cannot tokenize Japanese text:

=# select * from ts_parse('default', 'これはペンです');
tokid | token
-------+----------------
2 | これはペンです

Unlike Latin:

=# select * from ts_parse('default', 'this is a pen');
tokid | token
-------+-------
1 | this
12 |
1 | is
12 |
1 | a
12 |
1 | pen

To add support for Japanese (and other languages) it is necessary to
write a new parser or fix the existing default parser.

On the other hand pgroonga's source code looks complex, and I doubt
that there are pgsql-hackers who know it and target languages well and
who will be able to port it to Postgres core.

--
Artur

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-01-02 17:31:22 Re: Postgres 16.1 - Bug: cache entry already complete
Previous Message Amadeo Gallardo 2024-01-02 15:36:11 Postgres 16.1 - Bug: cache entry already complete