Re: Thoughts on NBASE=100000000

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Thoughts on NBASE=100000000
Date: 2024-07-08 10:45:13
Message-ID: CAEze2Wi+UGrG4SewoJTTCgVkoxSX=2j-4Sz=KkdhWo6W127cww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, 7 Jul 2024, 22:40 Joel Jacobson, <joel(at)compiler(dot)org> wrote:
>
> Hello hackers,
>
> I'm not hopeful this idea will be fruitful, but maybe we can find solutions
> to the problems together.
>
> The idea is to increase the numeric NBASE from 1e4 to 1e8, which could possibly
> give a significant performance boost of all operations across the board,
> on 64-bit architectures, for many inputs.
>
> Last time numeric's base was changed was back in 2003, when d72f6c75038 changed
> it from 10 to 10000. Back then, 32-bit architectures were still dominant,
> so base-10000 was clearly the best choice at this time.
>
> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
> have been the best choice, since the square of that still fits in
> a 64-bit signed int.

Back then 64-bit was by far not as dominant (server and consumer chips
with AMD64 ISA only got released that year after the commit), so I
don't think 1e8 would have been the best choice at that point in time.
Would be better now, yes, but not back then.

> Changing NBASE might seem impossible at first, due to the existing numeric data
> on disk, and incompatibility issues when numeric data is transferred on the
> wire.
>
> Here are some ideas on how to work around some of these:
>
> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
> when reading data.

I think that a dynamic decision would make more sense here. At low
precision, the overhead of 4+1 bytes vs 2 bytes is quite significant.
This sounds important for overall storage concerns, especially if the
padding bytes (mentioned below) are added to indicate types.

> - Due to the lack of a version field in the NumericVar struct,
> we need a way to detect if a Numeric value on disk uses
> the existing NBASE=1e4, or NBASE=1e8.
> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
> defined as:
> #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
> will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
> The idea is then to let "NUMERIC_NBYTES divisible by three"
> indicate NBASE=1e8, at the cost of one to three extra padding bytes.

Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an
uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only
multiples of 3? I'm asking because there are many integers divisible
by both 2 and 3 (all integer multiples of 6; that's 50% of the
multiples of 3), so with the multiple-of-3 scheme we might need up to
5 pad bytes to get to the next multiple of 3 that isn't also a
multiple of 2. Additionally, if the last digit woud've fit in
NBASE_1e4, then the 1e8-based numeric value could even be 7 bytes
larger than the equivalent 1e4-based numeric.

While I don't think this is worth implementing for general usage, it
could be worth exploring for the larger numeric values, where the
relative overhead of the larger representation is lower.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiro.Ikeda 2024-07-08 11:03:51 Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?
Previous Message jian he 2024-07-08 10:37:00 array_in sub function ReadArrayDimensions error message