Compressing large column by moving it to a unique table

From: Adrian Garcia Badaracco <adrian(at)adriangb(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Compressing large column by moving it to a unique table
Date: 2024-02-13 03:11:34
Message-ID: CAE8z92FeuuTUb65=Dyz7KnLhTOFx43ya=S61ReY5768kJniZBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I am using Timescale so I'll be mentioning some timestamp stuff but I think
this is a general postgres question for the most part.

I have a table with some fixed, small columns (id, timestamp, etc) and a
large JSONB column (let's call it `attributes`). `attributes` has 1000s of
schemas, but given a schema, there's a lot of duplication. Across all rows,
more than 99% of the data is duplicated (as measured by `count(attributes)`
vs `count(distinct attributes)`.

I can't normalize `attributes` into real columns because it is quite
variable (remember 1000s of schemas).

My best idea is to make a table like `(day timestamptz, hash text,
attributes jsonb)` and then in my original table replace `attributes` with
a reference to `new_table`. I can then make a view that joins them `select
original_table.timestamp, new_table.attributes from original join new_table
on (time_bucket('1 day', timestamp) = day AND original.hash =
new_table.hash)` or something like that. The idea of time bucketing into 1
day is to balance write and read speed (by relying on timescale to do
efficient time partitioning, data retention, etc.).

I recognize this is essentially creating a key-value store in postgres and
also janky compression, so I am cautious about it.

Benchmarks showed huge wins (26x runtime, 100x buffers) for a selective
query on `data`:

SELECT count(*)
FROM joined_view
WHERE (
(attributes->'http.status_code')::int = 422
AND
start_timestamp > (now() - interval '2 day')
AND
start_timestamp < (now() - interval '1 day')
);

As expected `SELECT * FROM joined_view` is slower than `SELECT * FROM
original`, but not by much (5x slower, 2x buffers, but they're both slow).

What I was sad to realize was that `SELECT start_timestamp FROM
joined_view` is also slow because the query planner can't know that the
join becomes a no-op if you don't select the `attributes` column (since
there might be more than 1 match for each left row, thus modifying the
result even if `attributes` wasn't selected). I don't know that this is a
deal breaker but it's certainly a bummer.

Does anyone have any suggestions for this sort of situation?

Thank you,

Adrian

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2024-02-13 06:16:10 Re: Compressing large column by moving it to a unique table
Previous Message veem v 2024-02-12 20:43:12 Re: How to do faster DML