From: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
---|---|
To: | Peter Eisentraut <peter(at)eisentraut(dot)org> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com> |
Subject: | Re: Reducing output size of nodeToString |
Date: | 2024-01-03 23:23:50 |
Message-ID: | CAEze2Wigkd1+J4s=7wUqW8Y4g9mDWSC28119ukbKkf799WBpzg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> > - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> > - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> > - Reset more fields to their default values before storing the values.
> > - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.
I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.
In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the consideration without taking
into account the as-of-yet unimplemented binary format.
> My suggestion is to make incremental patches along these lines:
> [...]
Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.
> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me. By splitting it up we can consider these
> incrementally.
That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.
select pg_database_size('template0') as "template0"
, pg_total_relation_size('pg_rewrite') as "pg_rewrite"
, sum(pg_column_size(ev_action)) as "compressed"
, sum(octet_length(ev_action)) as "raw"
from pg_rewrite;
version | template0 | pg_rewrite | compressed | raw
---------|-----------+------------+------------+---------
master | 7545359 | 761856 | 573307 | 2998712
0001 | 7365135 | 622592 | 438224 | 1943772
0002 | 7258639 | 573440 | 401660 | 1835803
0003 | 7258639 | 565248 | 386211 | 1672539
0004 | 7176719 | 483328 | 317099 | 1316552
0005 | 7176719 | 483328 | 315556 | 1300420
0006 | 7160335 | 466944 | 302806 | 1208621
0007 | 7143951 | 450560 | 287659 | 1187237
While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)
Attachment | Content-Type | Size |
---|---|---|
v1-0001-pg_node_tree-Don-t-serialize-fields-with-type-def.patch | application/octet-stream | 22.8 KB |
v1-0002-pg_node_tree-reset-node-location-before-catalog-s.patch | application/octet-stream | 12.9 KB |
v1-0005-NodeSupport-Don-t-emit-trailing-0s-in-outDatum.patch | application/octet-stream | 2.4 KB |
v1-0004-NodeSupport-add-some-more-default-markers-for-var.patch | application/octet-stream | 4.6 KB |
v1-0003-Nodesupport-add-support-for-custom-default-values.patch | application/octet-stream | 13.2 KB |
v1-0007-NodeSupport-Apply-RLE-and-differential-encoding-o.patch | application/octet-stream | 6.5 KB |
v1-0006-NodeSupport-Apply-some-more-defaults-serializatio.patch | application/octet-stream | 16.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Jim Nasby | 2024-01-03 23:25:59 | Re: add function argument names to regex* functions. |
Previous Message | Cedric Villemain | 2024-01-03 23:23:43 | Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED |