Re: lztext and compression ratios...

From: JanWieck(at)t-online(dot)de (Jan Wieck)
To: Jeffery Collins <collins(at)onyx-technologies(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: lztext and compression ratios...
Date: 2000-07-05 22:16:43
Message-ID: 200007052216.AAA12189@hot.jw.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-hackers pgsql-sql

Jeffery Collins wrote:
> I have been looking at using the lztext type and I have some
> questions/observations. Most of my experience comes from attempting to
> compress text records in a different database (CTREE), but I think the
> experience is transferable.

First of all I welcome any suggestions/input to the
compression code I've implemented. Seems you're experienced
in this area so keep going and hammer down any of my points
below.

You should know that the "lztext" type will disappear soon
and be replaced with the general "all varlena types are
potentially compressable" approach of TOAST.

> My typical table consists of variable length text records. The average
> length record is around 1K bytes. I would like to compress my records
> to save space and improve I/O performance (smaller records means more
> records fit into the file system cache which means less I/O - or so the
> theory goes). I am not too concerned about CPU as we are using a 4-way
> Sun Enterprise class server. So compress seems like a good idea to me.
>
> My experience with attempting to compress such a relatively small
> (around 1K) text string is that the compression ration is not very
> good. This is because the string is not long enough for the LZ
> compression algorithm to establish really good compression patterns and
> the fact that the de-compression table has to be built into each
> record. What I have done in the past to get around these problems is
> that I have "taught" the compression algorithm the patterns ahead of
> time and stored the de-compression patterns in an external table. Using
> this technique, I have achieved *much* better compression ratios.

The compression algorithm used in "lztext" (and so in TOAST
in the future) doesn't have a de-compression table at all.
It's based on Adisak Pochanayon's SLZ algorithm, using
<literal_char> or <token>. A <token> just tells how far to
go back in the OUTPUT-buffer and how many bytes to copy from
OUTPUT to OUTPUT. Look at the code for details.

My design rules for the compression inside of Postgres have
been

- beeing fast on decompression
- beeing good for relatively small values
- beeing fast on compression

The first rule is met by the implementation itself. Don't
underestimate this design rule! Usually you don't update as
often as you query. And the implementation of TOAST requires
a super fast decompression.

The second rule is met by not needing any initial
decompression table inside of the stored value.

The third rule is controlled by the default strategy of the
algorithm, (unfortunately) hardcoded into
utils/adt/pg_lzcompress.c. It'll never try to compress items
smaller than 256 bytes. It'll fallback to plain storage (for
speed advantage while decompressing a value) if less than 20%
of compression is gained. It'll stop match loookup if a
backward match of 128 or more bytes is found.

> So my questions/comments are:
>
> - What are the typical compression rations on relatively small (i.e.
> around 1K) strings seen with lztext?

Don't have that small items handy. But a table containing a
file path and it's content. All files where HTML files.

From - To | count(*) | avg(length) | avg(octet_length)
---------------+----------+-------------+------------------
1024 - 2047 | 14 | 1905 | 1470
2048 - 4095 | 67 | 3059 | 1412
4096 - 8191 | 45 | 5384 | 2412
8192 - | 25 | 17200 | 6323
---------------+----------+-------------+------------------
all | 151 | 5986 | 2529

> - Does anyone see a need/use for a generalized string compression
> type that can be "trained" external to the individual records?

Yes, of course. Maybe "lztext" can be a framework for you and
we just tell the toaster "never apply your lousy compression
on that" (it's prepared for).

> - Am I crazy in even attempting to compress strings of this relative
> size? My largest table correct contains about 2 million entries of
> roughly 1k size strings or about 2Gig of data. If I could compress this
> to about 33% of it's original size (not unreasonable with a trained LZ
> compression), I would save a lot of disk space (not really important)
> and a lot of file system cache space (very important) and be able to fit
> the entire table into memory (very, very important).

Noone is crazy attempting to improve something. It might turn
out not to work well, or beeing brain damaged from the start.
But someone who never tries will miss all his chances.

Final note:

One thing to keep in mind is that the LZ algorithm you're
thinking of must be distributable under the terms of the BSD
license. If it's copyrighted or patented by any third party,
not agreeing to these terms, it's out of discussion and will
never appear in the Postgres source tree. Especially the LZ
algorithm used in GIF is one of these show-stoppers.

Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2000-07-05 22:19:52 Re: Need help with error
Previous Message Ned Lilly 2000-07-05 22:06:08 Re: [HACKERS] Re: Revised Copyright: is this morepalatable?

Browse pgsql-hackers by date

  From Date Subject
Next Message The Hermit Hacker 2000-07-05 22:30:02 Re: fail in installing postgresql-6.5.3 to FreeBSD
Previous Message Ned Lilly 2000-07-05 22:06:08 Re: [HACKERS] Re: Revised Copyright: is this morepalatable?

Browse pgsql-sql by date

  From Date Subject
Next Message Tom Lane 2000-07-05 22:40:31 Re: lztext and compression ratios...
Previous Message Jan Wieck 2000-07-05 20:41:18 Re: PostgreSQL 7.1