Re: Reducing output size of nodeToString

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Subject: Re: Reducing output size of nodeToString
Date: 2024-02-12 18:03:30
Message-ID: CAEze2WhTQ1LGtpyobcLFvda61BDtkiYvCfyuUA9-Mi9iwd-gyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 31 Jan 2024 at 18:47, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> On Wed, Jan 31, 2024 at 11:17 AM Matthias van de Meent
> <boekewurm+postgres(at)gmail(dot)com> wrote:
> > I was also thinking about smaller per-attribute expression storage, for index attribute expressions, table default expressions, and functions. Other than that, less memory overhead for the serialized form of these constructs also helps for catalog cache sizes, etc.
> > People complained about the size of a fresh initdb, and I agreed with them, so I started looking at low-hanging fruits, and this is one.
> >
> > I've not done any tests yet on whether it's more performant in general. I'd expect the new code to do a bit better given the extremely verbose nature of the data and the rather complex byte-at-a-time token read method used, but this is currently hypothesis.
> > I do think that serialization itself may be slightly slower, but given that this generally happens only in DDL, and that we have to grow the output buffer less often, this too may still be a net win (but, again, this is an untested hypothesis).
>
> I think we're going to have to have separate formats for debugging and
> storage if we want to get very far here. The current format sucks for
> readability because it's so verbose, and tightening that up where we
> can makes sense to me. For me, that can include things like emitting
> unset location fields for sure, but delta-encoding of bitmap sets is
> more questionable. Turning 1 2 3 4 5 6 7 8 9 10 into 1-10 would be
> fine with me because that is both shorter and more readable, but
> turning 2 4 6 8 10 into 2 2 2 2 2 is way worse for a human reader.
> Such optimizations might make sense in a format that is designed for
> computer processing only but not one that has to serve multiple
> purposes.

I suppose so, yes. I've removed the delta-encoding from the
serialization of bitmapsets in the attached patchset.

Peter E. and I spoke about this patchset at FOSDEM PGDay, too. I said
to him that I wouldn't mind if this patchset was only partly applied:
The gains for most of the changes are definitely worth it even if some
others don't get in.

I think it'd be a nice QoL and storage improvement if even only (say)
the first two patches were committed, though the typmod default
markings (or alternatively, using a typedef-ed TypMod and one more
type-specific serialization handler) would also be a good improvement
without introducing too many "common value = default = omitted"
considerations that would reduce debugability.

Attached is patchset v2, which contains the improvements from these patches:

0001 has the "omit defaults" for the current types.
-23.5%pt / -35.1%pt (toasted / raw)
0002+0003 has new #defined type "Location" for those fields in Nodes
that point into (or have sizes of) query texts, and adds
infrastructure to conditionally omit them at all (see previous
discussions)
-3.5%pt / -6.3%pt
0004 has new #defined type TypeMod as alias for int32, that uses a
default value of -1 for (de)serialization purposes.
-3.0%pt / -6.1%pt
0005 updates Const node serialization to omit `:constvalue` if the
value is null.
+0.1%pt / -0.1%pt [^0]
0006 does run-length encoding for bitmaps and the various typed
integer lists, using "+int" as indicators of a run of a certain
length, excluding the start value.
Bitmaps, IntLists and XidLists are based on runs with increments
of 1 (so, a notation (i 1 +3) means (i 1 2 3 4), while OidLists are
based on runs with no increments (so, (o 1 +3) means (o 1 1 1 1).
-2.5%pt / -0.6%pt
0007 does add some select custom 'default' values, in that the
varnosyn and varattnosyn fields now treat the value of varno and
varattno as their default values.
This reduces the size of lists of Vars significantly and has a
very meaningful impact on the size of the compressed data (the default
pg_rewrite dataset contains some 10.8k Var nodes).
-10.4%pt / 9.7%pt

Total for the full applied patchset:
55.5% smaller data in pg_rewrite.ev_action before TOAST
45.7% smaller data in pg_rewrite.ev_action after applying TOAST

Toast relation size, as fraction of the main pg_rewrite table:
select pg_relation_size(2838) *1.0 / pg_relation_size('pg_rewrite');
master: 4.7
0007: 1.3

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^0]: The small difference in size for patch 0005 is presumably due to
low occurrance of NULL-valued Const nodes. Additionally, the inline vs
out-of-line TOASTed data and data (not) fitting on the last blocks of
each relation are likely to cause the change in total compression
ratio. If we had more null-valued Const nodes in pg_rewrite, the ratio
would presumably have been better than this.

PS: query I used for my data collection, + combined data:

select 'master' as "version"
, pg_database_size('template0') as "template0"
, pg_total_relation_size('pg_rewrite') as "pg_rewrite"
, sum(pg_column_size(ev_action)) as "toasted"
, sum(octet_length(ev_action)) as "raw";

version | template0 | pg_rewrite | toasted | raw
---------+-----------+------------+---------+---------
master | 7537167 | 770048 | 574003 | 3002556
0001 | 7348751 | 630784 | 438852 | 1946364
0002 | 7242255 | 573440 | 403160 | 1840404
0003 | 7242255 | 573440 | 402325 | 1838367
0004 | 7225871 | 557056 | 384888 | 1652287
0005 | 7234063 | 565248 | 385678 | 1648717
0006 | 7217679 | 548864 | 371256 | 1627733
0007 | 7143951 | 475136 | 311255 | 1337496

Attachment Content-Type Size
v2-0005-nodeToString-omit-serializing-NULL-datums-in-Cons.patch application/octet-stream 1.9 KB
v2-0001-pg_node_tree-Omit-serialization-of-fields-with-de.patch application/octet-stream 23.0 KB
v2-0003-gen_node_support.pl-Mark-location-fields-as-type-.patch application/octet-stream 26.4 KB
v2-0002-pg_node_tree-Don-t-store-query-text-locations-in-.patch application/octet-stream 19.2 KB
v2-0004-gen_node_support.pl-Add-a-TypMod-type-for-signall.patch application/octet-stream 10.4 KB
v2-0007-gen_node_support.pl-Optimize-serialization-of-fie.patch application/octet-stream 9.2 KB
v2-0006-nodeToString-Apply-RLE-on-Bitmapset-and-numeric-L.patch application/octet-stream 7.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-02-12 18:20:04 Re: Patch: Add parse_type Function
Previous Message Tom Lane 2024-02-12 17:53:09 Re: Patch: Add parse_type Function