Proposal: Adding compression of temporary files

From: Filip Janus <fjanus(at)redhat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposal: Adding compression of temporary files
Date: 2024-11-14 22:13:16
Message-ID: CAFjYY+LAxqRpaJpukBx9KjeyhZvMu6O4xcCUU9gw1SFGsUTcDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,
Postgresql supports data compression nowadays, but the compression of
temporary files has not been implemented yet. The huge queries can
produce a significant amount of temporary data that needs to be stored on
disk
and cause many expensive I/O operations.
I am attaching a proposal of the patch to enable temporary files
compression for
hashjoins for now. Initially, I've chosen the LZ4 compression algorithm. It
would
probably make better sense to start with pglz, but I realized it late.

# Future possible improvements
Reducing the number of memory allocations within the dumping and loading of
the buffer. I have two ideas for solving this problem. I would either add a
buffer into
struct BufFile or provide the buffer as an argument from the caller. For
the sequential
execution, I would prefer the second option.

# Future plan/open questions
In the future, I would like to add support for pglz and zstd. Further, I
plan to
extend the support of the temporary file compression also for sorting, gist
index creation, etc.

Experimenting with the stream mode of compression algorithms. The
compression
ratio of LZ4 in block mode seems to be satisfying, but the stream mode
could
produce a better ratio, but it would consume more memory due to the
requirement to store
context for LZ4 stream compression.

# Benchmark
I prepared three different databases to check expectations. Each
dataset is described below. My testing demonstrates that my patch
improves the execution time of huge hash joins.
Also, my implementation should not
negatively affect performance within smaller queries.
The usage of memory needed for temporary files was reduced in every
execution without a significant impact on execution time.

*## Dataset A:*
Tables
table_a(bigint id,text data_text,integer data_number) - 10000000 rows
table_b(bigint id, integer ref_id, numeric data_value, bytea data_blob) -
10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;

The tables contain highly compressible data.
The query demonstrated a reduction in the usage of the temporary
files ~20GB -> 3GB, based on this reduction also caused the execution
time of the query to be reduced by about ~10s.

*## Dataset B:*
Tables:
table_a(integer id, text data_blob) - 1110000 rows
table_b(integer id, text data_blob) - 10000000 rows
Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;

The tables contain less compressible data. data_blob was generated by a
pseudo-random generator.
In this case, the data reduction was only ~50%. Also, the execution time
was reduced
only slightly with the enabled compression.

The second scenario demonstrates no overhead in the case of enabled
compression and extended work_mem to avoid temp file usage.

*## Dataset C:*
Tables
customers (integer,text,text,text,text)
order_items(integer,integer,integer,integer,numeric(10,2))
orders(integer,integer,timestamp,numeric(10,2))
products(integer,text,text,numeric(10,2),integer)

Query: SELECT p.product_id, p.name, p.price, SUM(oi.quantity) AS
total_quantity, AVG(oi.price) AS avg_item_price
FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
oi.product_id JOIN
eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date > '2020-01-01'
AND p.price > 50
GROUP BY p.product_id, p.name, p.price HAVING SUM(oi.quantity) > 1000
ORDER BY total_quantity DESC LIMIT 100;

This scenario should demonstrate a more realistic usage of the database.
Enabled compression slightly reduced the temporary memory usage, but the
execution
time wasn't affected by compression.

+------------+-------------------------+-----------------------+------------------------------+
| Dataset | Compression. | temp_bytes | Execution Time
(ms) |
+------------+-------------------------+-----------------------+-----------------------------
+
| A | Yes | 3.09 GiB |
22s586ms | work_mem = 4MB
| | No | 21.89 GiB |
35s | work_mem = 4MB
+------------+-------------------------+-----------------------+----------------------------------------
| B | Yes | 333 MB |
1815.545 ms | work_mem = 4MB
| | No | 146 MB |
1500.460 ms | work_mem = 4MB
| | Yes | 0 MB
| 3262.305 ms | work_mem = 80MB
| | No | 0 MB
| 3174.725 ms | work_mem = 80MB
+-------------+------------------------+------------------------+-------------------------------------
| C | Yes | 40 MB
| 1011.020 ms | work_mem = 1MB
| | No | 53 MB |
1034.142 ms | work_mem = 1MB
+------------+------------------------+------------------------+--------------------------------------

Regards,

-Filip-

Attachment Content-Type Size
0001-This-commit-adds-support-for-temporary-files-compres.patch application/octet-stream 13.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-11-14 22:21:30 Re: 039_end_of_wal: error in "xl_tot_len zero" test
Previous Message Tom Lane 2024-11-14 20:49:02 Re: Potential ABI breakage in upcoming minor releases