From: | "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: plan_create_index_workers doesn't account for TOAST |
Date: | 2023-06-29 19:51:10 |
Message-ID: | 950a26ba-ed90-6b24-3335-05e401739873@postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 6/29/23 10:12 AM, Jonathan S. Katz wrote:
> Hi,
>
> plan_create_index_workers[1] does not consider the amount of tuples
> existing in TOAST pages when determining the number of parallel workers
> to use for a build. The estimation comes from estimate_rel_size[2],
> which in this case, will just take the value from rel->rd_rel->relpages.
>
> We probably don't notice this much with B-trees, given a B-tree is
> typically used for data that does not require toasting. However, this
> becomes more visible when working on custom index access methods that
> implement their own parallel build strategy.
>
> For example, pgvector[3] provides its own data types and index access
> method for indexing vector data. Vectors can get quite large fairly
> quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes
> on disk, which quickly clears the default TOAST tuple threshold.
>
> In a recent patch proposal to allow for building indexes in parallel[4],
> I performed a few experiments on how many parallel workers would be
> spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with
> EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader
> participation, but the leader is not considered in
> plan_create_index_workers.
>
> With EXTENDED, plan_create_index_workers recommended 2 workers. The
> build time was ~2x faster than the serial build.
>
> With PLAIN, plan_create_index_workers recommended 4 workers. The build
> time was **~3X faster** than the serial build.
>
> (I've been doing more detailed, less hand-waivy performance testing, but
> I wanted to provide directional numbers here)
>
> It seems like we're leaving some performance for columns with TOASTed
> data that require indexing, so I wanted to propose allowing the pages in
> TOASTed tables to be considered when we're trying to index a column with
> TOASTed attributes.
Just to add to this: there is a lever to get more parallel workers by
setting "min_parallel_table_scan_size" to a lower value, which does help
in this case. However, it does mask the fact that a large chunk of the
data required to build the index exists in the TOAST table, which is not
intuitive to a user who rarely has to use tuning parameters.
Thanks,
Jonathan
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2023-06-29 20:29:40 | Re: pgsql: Fix search_path to a safe value during maintenance operations. |
Previous Message | Tristan Partin | 2023-06-29 19:17:30 | Re: Meson build updates |