Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)

From: Ron <ronljohnsonjr(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)
Date: 2022-09-06 20:52:43
Message-ID: 6a0551f5-3930-5cff-2886-80396f39f803@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

By "SEALED", do you mean "READ ONLY"?

On 9/6/22 14:39, Levi Aul wrote:
> My company runs some large OLAP data warehouses with append-only,
> time-partitioned datasets. Our workloads involve aggregations and joins,
> and query the data in ways not amenable to constraint-exclusion; and we
> serve a high concurrent number of these queries at once from a single DB.
>
> In other words, our  workload is inherently one that acquires "way too
> many locks." Our largest performance bottleneck, according to
> pg_wait_sampling, is the LockManager itself. Despite most of our queries
> spending only milliseconds actually executing, they often spend seconds
> during planning waiting to acquire hundreds of access-shared locks.
>
> Given that our datasets are append-only, all our partitions for each table
> save for the one "active" one (the one for the current time period) are
> effectively immutable. No DML-triggered writes will occur to these. I
> think this is pretty common in data-warehouse use-cases of PG.
>
> If PG could avoid the need to acquire the locks for these
> effectively-immutable partitions, then the remaining number of tables
> would be low enough to fit into the per-backend LWLock slots set, and so
> avoid LockManager contention. I believe this could be a large optimization
> not just for our use-case, but in a number of other high-concurrency OLAP
> use-cases.
>
> My proposal for how this "lock elision under large numbers of immutable
> partitions" could be accomplished:
>
> 1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER
> TABLE ... SEAL PARTITION foo)
> 2. When query-planning DML against a partition or a partitioned table,
> treat a sealed partition as if it had an always-false check constraint.
> 2. Define a "locking group" abstraction, where many entities can register
> themselves under the same lock, such that access to all members of the
> locking group requires only acquiring the single locking-group lock. All
> sealed partitions of the same table would share a locking group.
>
> Under such a setup, querying a time-based partitioned table with one
> active (unsealed) partition would only ever require acquiring, at most,
> two locks — the one for the active partition, and the one for the
> sealed-partitions locking group.
>
> The trade-off for this is that acquiring an exclusive-access lock on the
> sealed-partitions locking-group for a table becomes much more expensive
> than it would have been to acquire for a single partition. But this isn't
> a problem in practice, because hot-path operations that take an
> exclusive-access lock (DML writes) are disallowed against sealed
> partitions. The only time the lock-group would need to be exclusive-access
> acquired, would be to change its membership — an administrative DDL operation.
>
> Besides being useful operationally, such a mechanism would also be helpful
> on the business-logic level, as you can rely on partition sealing to turn
> accidental insertions of new data into any but the active partition(s)
> into a constraint violation. (Currently, to achieve this, separate
> triggers need to be maintained on each sealed partition.)
>
> And, with knowledge of the administrative intent for a table to be
> immutable, further operational optimizations could be performed. A few off
> the top of my head:
>
> 1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked
> as immutable, could rewrite the table using an implicit "heap_immutable"
> access method (still reported as "heap"), which would drop the min_xid
> column (as everything in a sealed table is guaranteed to be
> always-visible), and thus remove the table for consideration for
> xid-wraparound-protection rewriting. Such partitions would then require a
> rewrite back to "heap" if unsealed.
>
> 2. Alternatively, such storage-rewriting DDL statements could switch the
> table — and its indices — over to using an entirely different
> access-methods, which would store the data+indices in "perfect" packed
> forms, to maximize read performance while also minimizing disk usage.
>
> 3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to
> populate statistics with exact aggregate values, requiring reading all
> rows rather than sampling rows. This could pre-bake table-level aggregates
> for most columns, like having a single, table-sized BRIN block-range.
>
> If this concept of "marking as sealed" were extended to tables rather than
> only partitions, then further work could be done related to optimization
> of bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL
> segments for the table as it is populated, but rather treat the table as
> UNLOGGED during population, and then, after creation, take the entire
> finalized/sealed table's backing files and either pass them directly to
> archive_command / send them directly to WAL receivers; or split+stream
> them into retrospective WAL segments (each segment containing a single
> "put this 16MB of data into this file at this position" op), and send those.

--
Angular momentum makes the world go 'round.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Dirschel, Steve 2022-09-06 21:02:59 RE: [EXT] Re: log_min_messages = warning
Previous Message Tom Lane 2022-09-06 20:44:29 Re: [EXT] Re: log_min_messages = warning