Re: ADD COLUMN ts tsvector GENERATED too slow

From: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
To: Florents Tselai <florents(dot)tselai(at)gmail(dot)com>
Cc: Francisco Olarte <folarte(at)peoplecall(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: ADD COLUMN ts tsvector GENERATED too slow
Date: 2022-07-06 10:43:47
Message-ID: D8E9BE00-14C4-4964-8DB5-D91A3FDA9456@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> On 6 Jul 2022, at 1:35 PM, Florents Tselai <florents(dot)tselai(at)gmail(dot)com> wrote:
>
>
>
>> On 6 Jul 2022, at 1:11 PM, Francisco Olarte <folarte(at)peoplecall(dot)com> wrote:
>>
>> On Wed, 6 Jul 2022 at 11:55, Florents Tselai <florents(dot)tselai(at)gmail(dot)com> wrote:
>>> Also, fwiw looking at top the CPU% and MEM% activity, looks like it does data crunching work.
>> ...
>>>>> On 06.07.22 10:42, Florents Tselai wrote:
>>>>>> I have a beefy server (40+ worker processes , 40GB+ shared buffers) and a table holding (key text, text text,) of around 50M rows.
>>>>>> These are text fields extracted from 4-5 page pdfs each.
>>
>> How big is yout table? from your query it seems you expect more than
>> 1M-1 ( left... ), but if you have very big text columns it may be
>> spending a lot of time fully decompressing / reading them ( I'm not
>> sure if it left(..) on toasted values is optimized to stop after
>> reading enough ). Also, it has to rewrite a lot of data to insert the
>> columns, it it takes some ms per row which I would not discard 50M
>> rows * 1 ms / row = 50ksecs = 500k secs ~=13.9 hours per ms-row, so at
>> 2 ms ( which may be right for reading a big row, calculating the
>> vector and writing an even bigger row ) it would take more than a day
>> to finish, which I would not discard given you are asking for a heavy
>> thing.
>
> 50M+ rows and iirc pg_relation_size was north of 80GB or so.
>
>>
>> If you have stopped it I would try doing a 1000 row sample in a copied
>
> Haven’t stopped it as I’m not convinced there’s an alternative to just waiting
> For it to complete :/
>
>> table to get an speed idea. Otherwise, with this query, I would
>> normally monitor disk usage of disk files as an indication of

Actually, I monitored my disk usage and it was **definitely** working as
It had already eaten up an additional 30% of my disk capacity.

Thus, I’ll have to fall back on my initial solution and use GIN indexes
To get ts_vectors on the fly.

>> progress, I'm not sure there is another thing you could look at
>> without disturbing it.
>>
>> FWIW, I would consider high mem usage normal in these kind of query,
>> hi cpu would depend on what you call it, but it wouldn't surprise me
>> if it has at least one cpu running at full detoasting and doing
>> vectors, I do not know if alter table can go paralell..
>>
>
> You’re probably right, a lot of the CPU usage could be detoasting.
>
>> Francisco Olarte.

Thanks everyone for your comments.
You can consider this solved.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2022-07-06 11:33:33 Re: ADD COLUMN ts tsvector GENERATED too slow
Previous Message Florents Tselai 2022-07-06 10:35:24 Re: ADD COLUMN ts tsvector GENERATED too slow