Re: pg full text search very slow for Chinese characters

From: Cory Nemelka <cnemelka(at)gmail(dot)com>
To: Jimmy Huang <jimmy_huang(at)live(dot)com>
Cc: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: pg full text search very slow for Chinese characters
Date: 2019-09-10 16:00:58
Message-ID: CAMe5Gn2etfyY3579KtZCS2NyD9uvYPMxPe=Ns40RC9RVtAUjwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Well, there is a lot of information we would need to diagnose this. How
much tuning have you done?, etc.

My advice is pretty simple. Don't expect performance on a notebook and,
unless you are planning on hosting it on a notebook, use the notebook for
development only . Test performance on a properly configured and tuned
server.

--cnemelka

On Tue, Sep 10, 2019 at 9:53 AM Jimmy Huang <jimmy_huang(at)live(dot)com> wrote:

> Hi Team,
>
>
>
> Can anyone shed some light on why postgres 11 is extremely slow in my case?
>
>
>
> I am making a mirror of zh.wikisource.org and I have downloaded 303049
> pages and stored them in a postgres 11 database.
>
>
>
> My postgres instance is based on docker image postgres:11 and runs on my
> MacBook Pro i7 16GB.
>
>
>
> Database schema is as follows
>
>
>
> Table pages(id, url, html, downloaded, inserted_at, updated_at) and
> books(id, name, info, preface, text, html, url, parent_id, inserted_at,
> updated_at, info_html, preface_html)
>
>
>
> A wikisource web page is downloaded and its html text is inserted into
> table “pages” column “html.
>
> Later, books.{name, info, preface, text, html, info_html, preface_html}
> are extracted from pages.html. The text column of books is a txt version of
> the content of html column of table pages.
>
>
>
> On average there are 7635 characters (each characters is 3 bytes long
> because of utf-8 encoding) for text column of table books and I want to add
> full text search to books(text).
>
>
>
> I tried pg_trgm and my own customized token parser
> https://github.com/huangjimmy/pg_cjk_parser
>
>
>
> To my surprise, postgres 11 is extremely slow when creating a full text
> index.
>
>
>
> I added a column of tsvector type and tried to create an index on that
> column. Pg could not finish creating a GIN index for a long time and I had
> to cancel the execution.
>
> I then tried to create a partial full text index for 500 rows and it took
> postgres 2 to 3 minutes to create the index. Based on this estimation, pg
> will need at least one day to create a full GIN full text search index for
> 303049 rows of data. I think this is ridiculous slow.
>
> If I tried to create fts index for books(name) or books(info), it took
> just 3 minutes to create the index. However, name and info are extremely
> short compared to books(text).
>
>
>
> I switched to Elasticsearch and it turned out that Elasticsearch is
> extremely efficient for my case. It took Elasticsearch 3 hours to index all
> 303049 rows.
>
>
>
> Jimmy Huang
>
> jimmy_huang(at)live(dot)com
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michael Lewis 2019-09-10 16:10:54 Re: pg full text search very slow for Chinese characters
Previous Message Jimmy Huang 2019-09-10 15:53:14 pg full text search very slow for Chinese characters