Re: Compressing temporary files

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Chainani, Naresh" <nareshkc(at)amazon(dot)com>
Subject: Re: Compressing temporary files
Date: 2021-10-08 13:47:22
Message-ID: CALj2ACVKEB7dbHfWXLvAjCk8yaQJxnZcyj1yAkJjUe_6Vj5-3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 11, 2021, 6:01 PM Andrey Borodin <x4mmm(at)yandex-team(dot)ru> wrote:
>
> Hi hackers!
>
> There's a lot of compression discussions nowadays. And that's cool!
> Recently Naresh Chainani in private discussion shared with me the idea to compress temporary files on disk.
> And I was thrilled to find no evidence of implementation of this interesting idea.
>
> I've prototyped Random Access Compressed File for fun[0]. The code is very dirty proof-of-concept.
> I compress Buffile by one block at a time. There are directory pages to store information about the size of each compressed block. If any byte of the block is changed - whole block is recompressed. Wasted space is never reused. If compressed block is more then BLCSZ - unknown bad things will happen :)
>
> Here are some my observations.
>
> 0. The idea seems feasible. API of fd.c used by buffile.c can easily be abstracted for compressed temporary files. Seeks are necessary, but they are not very frequent. It's easy to make temp file compression GUC-controlled.
>
> 1. Temp file footprint can be easily reduced. For example query
> create unlogged table y as select random()::text t from generate_series(0,9999999) g;
> uses for toast index build 140000000 bytes of temp file. With patch this value is reduced to 40841704 (x3.42 smaller).
>
> 2. I have not found any evidence of performance improvement. I've only benchmarked patch on my laptop. And RAM (page cache) diminished any difference between writing compressed block and uncompressed block.
>
> How do you think: does it worth to pursue the idea? OLTP systems rarely rely on data spilled to disk.
> Are there any known good random access compressed file libs? So we could avoid reinventing the wheel.
> Maybe someone tried this approach before?

Are you proposing to compress the temporary files being created by the
postgres processes under $PGDATA/base/pgsql_tmp? Are there any other
directories that postgres processes would write temporary files to?

Are you proposing to compress the temporary files that get generated
during the execution of queries? IIUC, the temp files under the
pgsql_tmp directory get cleaned up at the end of each txn right? In
what situations the temporary files under the pgsql_tmp directory
would remain even after the txns that created them are
committed/aborted? Here's one scenario: if a backend crashes while
executing a huge analytic query, I can understand that the temp files
would remain in pgsql_tmp and we have the commit [1] cleaning them on
restart. Any other scenarios that fill up the pgsql_tmp directory?

[1] commit cd91de0d17952b5763466cfa663e98318f26d357
Author: Tomas Vondra <tomas(dot)vondra(at)postgresql(dot)org>
Date: Thu Mar 18 16:05:03 2021 +0100

Remove temporary files after backend crash

Regards,
Bharath Rupireddy.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2021-10-08 14:12:30 Re: RFC: compression dictionaries for JSONB
Previous Message Mikael Kjellström 2021-10-08 13:08:23 Re: Time to upgrade buildfarm coverage for some EOL'd OSes?