Re: plan_create_index_workers doesn't account for TOAST

From: "Jonathan S(dot) Katz" <jkatz(at)postgresql(dot)org>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: plan_create_index_workers doesn't account for TOAST
Date: 2023-06-29 19:51:10
Message-ID: 950a26ba-ed90-6b24-3335-05e401739873@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/29/23 10:12 AM, Jonathan S. Katz wrote:
> Hi,
>
> plan_create_index_workers[1] does not consider the amount of tuples
> existing in TOAST pages when determining the number of parallel workers
> to use for a build. The estimation comes from estimate_rel_size[2],
> which in this case, will just take the value from rel->rd_rel->relpages.
>
> We probably don't notice this much with B-trees, given a B-tree is
> typically used for data that does not require toasting. However, this
> becomes more visible when working on custom index access methods that
> implement their own parallel build strategy.
>
> For example, pgvector[3] provides its own data types and index access
> method for indexing vector data. Vectors can get quite large fairly
> quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes
> on disk, which quickly clears the default TOAST tuple threshold.
>
> In a recent patch proposal to allow for building indexes in parallel[4],
> I performed a few experiments on how many parallel workers would be
> spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with
> EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader
> participation, but the leader is not considered in
> plan_create_index_workers.
>
> With EXTENDED, plan_create_index_workers recommended 2 workers. The
> build time was ~2x faster than the serial build.
>
> With PLAIN, plan_create_index_workers recommended 4 workers. The build
> time was **~3X faster** than the serial build.
>
> (I've been doing more detailed, less hand-waivy performance testing, but
> I wanted to provide directional numbers here)
>
> It seems like we're leaving some performance for columns with TOASTed
> data that require indexing, so I wanted to propose allowing the pages in
> TOASTed tables to be considered when we're trying to index a column with
> TOASTed attributes.

Just to add to this: there is a lever to get more parallel workers by
setting "min_parallel_table_scan_size" to a lower value, which does help
in this case. However, it does mask the fact that a large chunk of the
data required to build the index exists in the TOAST table, which is not
intuitive to a user who rarely has to use tuning parameters.

Thanks,

Jonathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-06-29 20:29:40 Re: pgsql: Fix search_path to a safe value during maintenance operations.
Previous Message Tristan Partin 2023-06-29 19:17:30 Re: Meson build updates