Re: Proposal: Adding compression of temporary files

From: Filip Janus <fjanus(at)redhat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal: Adding compression of temporary files
Date: 2024-11-18 21:58:27
Message-ID: CAFjYY+JJ3x-QUBpSYr5eTdapERhS9Nw3SEAH+QnBB=kypoXUJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Let's fix the compiler warning caused by an uninitialized local variable.

-Filip-

čt 14. 11. 2024 v 23:13 odesílatel Filip Janus <fjanus(at)redhat(dot)com> napsal:

> Hi all,
> Postgresql supports data compression nowadays, but the compression of
> temporary files has not been implemented yet. The huge queries can
> produce a significant amount of temporary data that needs to be stored on
> disk
> and cause many expensive I/O operations.
> I am attaching a proposal of the patch to enable temporary files
> compression for
> hashjoins for now. Initially, I've chosen the LZ4 compression algorithm.
> It would
> probably make better sense to start with pglz, but I realized it late.
>
> # Future possible improvements
> Reducing the number of memory allocations within the dumping and loading of
> the buffer. I have two ideas for solving this problem. I would either add
> a buffer into
> struct BufFile or provide the buffer as an argument from the caller. For
> the sequential
> execution, I would prefer the second option.
>
> # Future plan/open questions
> In the future, I would like to add support for pglz and zstd. Further, I
> plan to
> extend the support of the temporary file compression also for sorting,
> gist index creation, etc.
>
> Experimenting with the stream mode of compression algorithms. The
> compression
> ratio of LZ4 in block mode seems to be satisfying, but the stream mode
> could
> produce a better ratio, but it would consume more memory due to the
> requirement to store
> context for LZ4 stream compression.
>
> # Benchmark
> I prepared three different databases to check expectations. Each
> dataset is described below. My testing demonstrates that my patch
> improves the execution time of huge hash joins.
> Also, my implementation should not
> negatively affect performance within smaller queries.
> The usage of memory needed for temporary files was reduced in every
> execution without a significant impact on execution time.
>
> *## Dataset A:*
> Tables
> table_a(bigint id,text data_text,integer data_number) - 10000000 rows
> table_b(bigint id, integer ref_id, numeric data_value, bytea data_blob) -
> 10000000 rows
> Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;
>
> The tables contain highly compressible data.
> The query demonstrated a reduction in the usage of the temporary
> files ~20GB -> 3GB, based on this reduction also caused the execution
> time of the query to be reduced by about ~10s.
>
>
> *## Dataset B:*
> Tables:
> table_a(integer id, text data_blob) - 1110000 rows
> table_b(integer id, text data_blob) - 10000000 rows
> Query: SELECT * FROM table_a a JOIN table_b b ON a.id = b.id;
>
> The tables contain less compressible data. data_blob was generated by a
> pseudo-random generator.
> In this case, the data reduction was only ~50%. Also, the execution time
> was reduced
> only slightly with the enabled compression.
>
> The second scenario demonstrates no overhead in the case of enabled
> compression and extended work_mem to avoid temp file usage.
>
> *## Dataset C:*
> Tables
> customers (integer,text,text,text,text)
> order_items(integer,integer,integer,integer,numeric(10,2))
> orders(integer,integer,timestamp,numeric(10,2))
> products(integer,text,text,numeric(10,2),integer)
>
> Query: SELECT p.product_id, p.name, p.price, SUM(oi.quantity) AS
> total_quantity, AVG(oi.price) AS avg_item_price
> FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
> oi.product_id JOIN
> eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
> '2020-01-01' AND p.price > 50
> GROUP BY p.product_id, p.name, p.price HAVING SUM(oi.quantity) > 1000
> ORDER BY total_quantity DESC LIMIT 100;
>
> This scenario should demonstrate a more realistic usage of the database.
> Enabled compression slightly reduced the temporary memory usage, but the
> execution
> time wasn't affected by compression.
>
>
>
> +------------+-------------------------+-----------------------+------------------------------+
> | Dataset | Compression. | temp_bytes | Execution Time
> (ms) |
> +------------+-------------------------+-----------------------+-----------------------------
> +
> | A | Yes | 3.09 GiB |
> 22s586ms | work_mem = 4MB
> | | No | 21.89 GiB |
> 35s | work_mem = 4MB
>
> +------------+-------------------------+-----------------------+----------------------------------------
> | B | Yes | 333 MB |
> 1815.545 ms | work_mem = 4MB
> | | No | 146 MB |
> 1500.460 ms | work_mem = 4MB
> | | Yes | 0 MB
> | 3262.305 ms | work_mem = 80MB
> | | No | 0 MB
> | 3174.725 ms | work_mem = 80MB
>
> +-------------+------------------------+------------------------+-------------------------------------
> | C | Yes | 40 MB
> | 1011.020 ms | work_mem = 1MB
> | | No | 53 MB |
> 1034.142 ms | work_mem = 1MB
>
> +------------+------------------------+------------------------+--------------------------------------
>
>
> Regards,
>
> -Filip-
>

Attachment Content-Type Size
0001-This-commit-adds-support-for-temporary-files-compres-v2.patch application/octet-stream 13.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Tomas Vondra 2024-11-18 21:38:53 Re: Showing applied extended statistics in explain Part 2