Re: Large files for relations

From: David Steele <david(at)pgmasters(dot)net>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Dagfinn Ilmari Mannsåker <ilmari(at)ilmari(dot)org>, Jim Mlodgenski <jimmy76(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: Large files for relations
Date: 2023-06-12 08:52:56
Message-ID: 05cc050b-af2c-7504-c628-63bdcf104702@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 5/28/23 08:48, Thomas Munro wrote:
>
> Alright, since I had some time to kill in an airport, here is a
> starter patch for initdb --rel-segsize.

I've gone through this patch and it looks pretty good to me. A few things:

+ * rel_setment_size, we will truncate the K+1st segment to 0 length

rel_setment_size -> rel_segment_size

+ * We used a phony GUC with a custome show function, because we don't

custome -> custom

+ if (strcmp(endptr, "kB") == 0)

Why kB here instead of KB to match MB, GB, TB below?

+ int64 relseg_size; /* blocks per segment of large relation */

This will require PG_CONTROL_VERSION to be bumped -- but you are
probably waiting until commit time to avoid annoying conflicts, though I
don't think it is as likely as with CATALOG_VERSION_NO.

> Some random thoughts:
>
> Another potential option name would be --segsize, if we think we're
> going to use this for temp files too eventually.

I feel like temp file segsize should be separately configurable for the
same reason that we are leaving it as 1GB for now.

> Maybe it's not so beautiful to have that global variable
> rel_segment_size (which replaces REL_SEGSIZE everywhere).

Maybe not, but it is the way these things are done in general, .e.g.
wal_segment_size, so I don't think it will be too controversial.

> Another
> idea would be to make it static in md.c and call smgrsetsegmentsize(),
> or something like that. That could be a nice place to compute the
> "shift" value up front, instead of computing it each time in
> blockno_to_segno(), but that's probably not worth bothering with (?).
> BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's
> about the only place where someone could say that this change makes
> things worse for people not interested in the new feature, so I was
> careful to get rid of / and % operations with no-longer-constant RHS.

Right -- not sure we should be troubling ourselves with trying to
optimize away ops that are very fast, unless they are computed trillions
of times.

> I had to promote segment size to int64 (global variable, field in
> control file), because otherwise it couldn't represent
> --rel-segsize=32TB (it'd be too big by one). Other ideas would be to
> store the shift value instead of the size, or store the max block
> number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
> (with more branches to test for it). The only problem I ran into with
> the larger type was that 'SHOW segment_size' now needs a custom show
> function because we don't have int64 GUCs.

A custom show function seems like a reasonable solution here.

> A C type confusion problem that I noticed: some code uses BlockNumber
> and some code uses int for segment numbers. It's not really a
> reachable problem for practical reasons (you'd need over 2 billion
> directories and VFDs to reach it), but it's wrong to use int if
> segment size can be set as low as BLCKSZ (one file per block); you
> could have more segments than an int can represent. We could go for
> uint32, BlockNumber or create SegmentNumber (which I think I've
> proposed before, and lost track of...). We can address that
> separately (perhaps by finding my old patch...)

I think addressing this separately is fine, though maybe enforcing some
reasonable minimum in initdb would be a good idea for this patch. For my
2c SEGSIZE == BLOCKSZ just makes very little sense.

Lastly, I think the blockno_to_segno(), blockno_within_segment(), and
blockno_to_seekpos() functions add enough readability that they should
be committed regardless of how this patch proceeds.

Regards,
-David

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2023-06-12 09:13:53 Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG
Previous Message Fujii.Yuki@df.MitsubishiElectric.co.jp 2023-06-12 08:51:30 RE: Partial aggregates pushdown