Re: jsonb format is pessimal for toast compression

From: Jan Wieck <jan(at)wi3ck(dot)info>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org, Larry White <ljw1001(at)gmail(dot)com>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-09-05 02:24:11
Message-ID: 54091ECB.1050100@wi3ck.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/08/2014 10:21 AM, Andrew Dunstan wrote:
>
> On 08/07/2014 11:17 PM, Tom Lane wrote:
>> I looked into the issue reported in bug #11109. The problem appears to be
>> that jsonb's on-disk format is designed in such a way that the leading
>> portion of any JSON array or object will be fairly incompressible, because
>> it consists mostly of a strictly-increasing series of integer offsets.
>> This interacts poorly with the code in pglz_compress() that gives up if
>> it's found nothing compressible in the first first_success_by bytes of a
>> value-to-be-compressed. (first_success_by is 1024 in the default set of
>> compression parameters.)
>
> [snip]
>
>> There is plenty of compressible data once we get into the repetitive
>> strings in the payload part --- but that starts at offset 944, and up to
>> that point there is nothing that pg_lzcompress can get a handle on. There
>> are, by definition, no sequences of 4 or more repeated bytes in that area.
>> I think in principle pg_lzcompress could decide to compress the 3-byte
>> sequences consisting of the high-order 24 bits of each offset; but it
>> doesn't choose to do so, probably because of the way its lookup hash table
>> works:
>>
>> * pglz_hist_idx -
>> *
>> * Computes the history table slot for the lookup by the next 4
>> * characters in the input.
>> *
>> * NB: because we use the next 4 characters, we are not guaranteed to
>> * find 3-character matches; they very possibly will be in the wrong
>> * hash list. This seems an acceptable tradeoff for spreading out the
>> * hash keys more.
>>
>> For jsonb header data, the "next 4 characters" are *always* different, so
>> only a chance hash collision can result in a match. There is therefore a
>> pretty good chance that no compression will occur before it gives up
>> because of first_success_by.
>>
>> I'm not sure if there is any easy fix for this. We could possibly change
>> the default first_success_by value, but I think that'd just be postponing
>> the problem to larger jsonb objects/arrays, and it would hurt performance
>> for genuinely incompressible data. A somewhat painful, but not yet
>> out-of-the-question, alternative is to change the jsonb on-disk
>> representation. Perhaps the JEntry array could be defined as containing
>> element lengths instead of element ending offsets. Not sure though if
>> that would break binary searching for JSON object keys.
>>
>>
>
>
> Ouch.
>
> Back when this structure was first presented at pgCon 2013, I wondered
> if we shouldn't extract the strings into a dictionary, because of key
> repetition, and convinced myself that this shouldn't be necessary
> because in significant cases TOAST would take care of it.
>
> Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
> can't find anything worth compressing in the first, for values larger
> than a certain size.
>
> It's worth noting that this is a fairly pathological case. AIUI the
> example you constructed has an array with 100k string elements. I don't
> think that's typical. So I suspect that unless I've misunderstood the
> statement of the problem we're going to find that almost all the jsonb
> we will be storing is still compressible.

I also think that a substantial part of the problem of coming up with a
"representative" data sample is because the size of the incompressible
data at the beginning is somewhat tied to the overall size of the datum
itself. This may or may not be true in any particular use case, but as a
general rule of thumb I would assume that the larger the JSONB document,
the larger the offset array at the beginning.

Would changing 1024 to a fraction of the datum length for the time being
give us enough room to come up with a proper solution for 9.5?

Regards,
Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2014-09-05 02:26:56 Re: jsonb format is pessimal for toast compression
Previous Message Jan Wieck 2014-09-05 02:04:40 Re: jsonb format is pessimal for toast compression