[WIP]Vertical Clustered Index (columnar store extension) - take2

From: "Aya Iwata (Fujitsu)" <iwata(dot)aya(at)fujitsu(dot)com>
To: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: [WIP]Vertical Clustered Index (columnar store extension) - take2
Date: 2024-10-07 14:53:19
Message-ID: OS7PR01MB119648CA4E8502FE89056E56EEA7D2@OS7PR01MB11964.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi All,

Suggestions
==========

When analyzing real-time data collected by PostgreSQL,
it can be difficult to tune the current PostgreSQL server for satisfactory performance.
Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory column store function that holds data in a state suitable for business analysis and is also expected to improve analysis performance.
With VCI, you can also expect to run analysis 7.8 times faster. This is achieved by the analytics engine, which optimizes parallel processing of column-oriented data, in addition to the fact that VCI stores data in a columnar format, enabling efficient retrieval of the columns needed for analysis.

Similar Features
============

One column store feature available with postgres is Citus Columnar Table.
If you introduces the citus extension, which allows columnar tables to be used using the columnar access method.
This function is intended to analyze the accumulated data. Therefore, you cannot update or delete data.
VCI supports data updates and deletions. This enables you to analyze not only the accumulated data but also the data that occurs in real time.

Implementing VCI
============

To introduce an updatable column store, we explain how updates to row-oriented data are propagated to column-oriented data.

VCI has two storage areas.

- Write Optimized Storage (WOS)
- Read Optimized Storage (ROS)

Describes WOS.
The WOS stores data for all columns in the VCI in a row-oriented format.
All newly added data is stored in the WOS relation along with the transaction information.
Using WOS to delete and update newly added data has no significant performance impact compared to deleting from columnar storage.

ROS is the storage area where all column data is stored.

When inserting/updating/deleting, data is written synchronously to WOS. It does not compress or index the data.
This avoids the overhead of converting to a columnar while updating the data.
After a certain amount of data accumulates in the WOS, the ROS control daemon converts it to column data asynchronously with updates.
Column data transformation compresses and indexes the data and writes it to ROS.

Describes searching for data.
Since there are two storage formats, the SELECT process needs to convert the WOS data to local ROS to determine whether it is visible or invisible. This conversion cost depends on the number of tuples present in the WOS file. This may introduce some performance overhead.
Obtain search results by referencing the local ROS and referencing the ROS in parallel.

These implementation ideas are also posted on Fujitsu's blog for your reference. [1]

Past discussions
===========

We've proposed features before. [2]
This thread also explains the details, so please check it.

In a previous thread, we suggested implementing a modification to the PostgreSQL backend code.
Based on the FB we received at that time, we think we need to re-implement this feature in pluggable storage using the table access method API.
I also got a FB of the features I needed from a PostgreSQLPro member. We believe it is necessary to deal with these issues in stages.
- Need to provide vector processing for nodes (filter, grand aggregate, aggregation with group by...) to speed up computation
- Requires parallel processing support such as scanning

It is assumed that the re-implementation will also require additional functionality to the current Table Access Method API.
It is useful not only for VCI but also for other access methods.
Therefore, we decided to propose the VCI feature to the community and proceed with development.

Request matter
===========

Are members of the PostgreSQL hackers interested in VCI features?
We welcome your comments and suggestions on this feature.
In particular, if you have any questions, required features, or implementations, please let me know.

[1] https://www.postgresql.fastware.com/blog/improve-data-analysis-performance-without-impacting-business-transactions-with-vertical-clustered-index

[2]https://www.postgresql.org/message-id/CAJrrPGfaC7WC9NK6PTTy6YN-NN+hCy8xOLAh2doYhVg5d6HsAA@mail.gmail.com

Regards,
Aya Iwata

FUJITSU LIMITED

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-10-07 15:08:49 Re: On disable_cost
Previous Message Tomas Vondra 2024-10-07 14:46:25 Re: Changing the state of data checksums in a running cluster