Re: Parallel CREATE INDEX for GIN indexes

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Parallel CREATE INDEX for GIN indexes
Date: 2024-08-28 01:15:33
Message-ID: CAEze2WjeDjVd7i=+hugzn7gB-HxEhmoBh1EOVuE+Kq=Ks6nmHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 28 Aug 2024 at 02:38, Andy Fan <zhihuifan1213(at)163(dot)com> wrote:
>
> Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> writes:
> > tuplesort_performsort usually only needs to flush the last elements of
> > ... In certain rare
> > cases it may take a longer time as it may have to merge sorted runs,
> > but those cases are quite rare in my experience.
>
> OK, I am expecting such cases are not rare, Suppose we have hundreds of
> GB heap tuple, and have the 64MB or 1GB maintenance_work_mem setup, it
> probably hit this sistuation. I'm not mean at which experience is the
> fact, but I am just to highlights the gap in our minds. and thanks for
> sharing this, I can pay more attention in this direction in my future
> work. To be clearer, my setup hit the 'mergeruns' case.

Huh, I've never noticed the performsort phase of btree index creation
(as seen in pg_stat_progress_create_index) take much time if any,
especially when compared to the tuple loading phase, so I assumed it
didn't happen often. Hmm, maybe I've just been quite lucky.

> >
> >> But 0008 still does many good stuff:
> >> 1. Reduce the IO usage, this would be more useful on some heavily IO
> >> workload.
> >> 2. Simplify the building logic by removing one stage.
> >> 3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
> >> case for now, but it looks like a reasonable design.
> >
> > I'd imagine nbtree would like to use this too, for applying some
> > deduplication in the sort stage.
>
> The current ntbtree do the deduplication during insert into the nbtree
> IIUC, in your new strategy, we can move it the "sort" stage, which looks
> good to me [to confirm my understanding].

Correct: We can do at least some deduplication in the sort stage. Not
all, because tuples need to fit on pages and we don't want to make the
tuples so large that we'd cause unnecessary splits while loading the
tree, but merging runs of 10-30 tuples should reduce IO requirements
by some margin for indexes where deduplication is important.

> > The IO benefits are quite likely to
> > be worth it; a minimum space saving of 25% on duplicated key values in
> > tuple sorts sounds real great.
>
> Just be clearer on the knowledge, the IO benefits can be only gained
> when the tuplesort's memory can't hold all the tuples, and in such case,
> tuplesort_performsort would run the 'mergeruns', or else we can't get
> any benefit?

It'd be when the tuplesort's memory can't hold all tuples, but
mergeruns isn't strictly required here, as dumptuples() would already
allow some tuple merging.

> >> I think the current blocker is if it is safe to hack the tuplesort.c.
> >> With my current knowledge, It looks good to me, but it would be better
> >> open a dedicated thread to discuss this specially, the review would not
> >> take a long time if a people who is experienced on this area would take
> >> a look.
> >
> > I could adapt the patch for nbtree use, to see if anyone's willing to
> > review that?
>
> I'm interested with it and can do some review & testing. But the
> keypoint would be we need some authorities are willing to review it, to
> make it happen to a bigger extent, a dedicated thread would be helpful.

Then I'll split it off into a new thread sometime later this week.

> > nbtree does sorted insertions into the tree, constructing leaf pages
> > one at a time and adding separator keys in the page above when the
> > leaf page was filled, thus removing the need to descend the btree. I
> > imagine we can save some performance by mirroring that in GIN too,
> > with as additional bonus that we'd be free to start logging completed
> > pages before we're done with the full index, reducing max WAL
> > throughput in GIN index creation.
>
> I agree this is a promising direction as well.

It'd be valuable to see if the current patch's "parallel sorted"
insertion is faster even than the current GIN insertion path even if
we use only the primary process, as it could be competative.
Btree-like bulk tree loading might even be meaningfully faster than
GIN's current index creation process. However, as I mentioned
significantly upthread, I don't expect that change to happen in this
patch series.

> > I imagine a pair of tuplesort_beginsortedrun();
> > tuplesort_endsortedrun() -functions to help this.
>
> This APIs do are better than the ones in my mind:) during the range
> between tuplesort_beginsortedrun and tuplesort_endsortedrun(), we can
> bypass the tuplessort's memory.

Exactly, we'd have the user call tuplesort_beginsortedrun(); then
iteratively insert its sorted tuples using the usual
tuplesort_putYYY() api, and then call _endsortedrun() when the sorted
run is complete. It does need some work in tuplesort state handling
and internals, but I think that's quite achievable.

Kind regards,

Matthias van de Meent

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2024-08-28 02:24:45 Re: RFC: Additional Directory for Extensions
Previous Message Masahiro.Ikeda 2024-08-28 00:46:12 RE: Doc: fix the note related to the GUC "synchronized_standby_slots"