From: | Nikhil Kumar Veldanda <veldanda(dot)nikhilkumar17(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: ZStandard (with dictionaries) compression support for TOAST compression |
Date: | 2025-03-06 20:59:01 |
Message-ID: | CAFAfj_GACKVftwuRjy3Ls-1Xc3ojUUbVh=Rm7KpRuYbaS=uLPg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Robert,
> I think that solving the problems around using a dictionary is going
> to be really hard. Can we see some evidence that the results will be
> worth it?
With the latest patch I've shared,
Using a Kaggle dataset of Nintendo-related tweets[1], we leveraged
PostgreSQL's acquire_sample_rows function to quickly gather just 1,000
sample rows for a specific attribute out of 104695 rows. These raw
samples were passed into Zstd's sampling buffer, generating a custom
dictionary. This dictionary was then directly used to compress the
documents, resulting in 62% of space savings after compressed:
```
test=# \dt+
List of tables
Schema | Name | Type | Owner | Persistence | Access
method | Size | Description
--------+----------------+-------+----------+-------------+---------------+--------+-------------
public | lz4 | table | nikhilkv | permanent | heap
| 297 MB |
public | pglz | table | nikhilkv | permanent | heap
| 259 MB |
public | zstd_with_dict | table | nikhilkv | permanent | heap
| 114 MB |
public | zstd_wo_dict | table | nikhilkv | permanent | heap
| 210 MB |
(4 rows)
```
We've observed similarly strong results on other datasets as well with
using dictionaries.
[1] https://www.kaggle.com/code/dcalambas/nintendo-tweets-analysis/data
---
Nikhil Veldanda
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2025-03-06 21:02:41 | Re: what's going on with lapwing? |
Previous Message | Jacob Champion | 2025-03-06 20:57:24 | Re: [PoC] Federated Authn/z with OAUTHBEARER |