Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: Marco Slot <marco(dot)slot(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, shiy(dot)fnst(at)fujitsu(dot)com, wangw(dot)fnst(at)fujitsu(dot)com, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date: 2023-01-27 13:02:13
Message-ID: CACawEhUxEfVr64ywC0sj7nDwFg=vR30qehNoqSDcx6VV2=B_Lw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Marco, Tom,

> But still it doesn't seem to me to be appropriate to use the planner to
find a suitable index.

As Marco noted, here we are trying to pick an index that is non-unique. We
could pick the index based on information extracted from pg_index (or
such), but then, it'd be a premature selection. Before sending the patch to
pgsql-hackers, I initially tried to find a suitable one with such an
approach.

But then, I still ended up using costing functions (and some other low
level functions). Overall, it felt like the planner is the module that
makes this decision best. Why would we try to invent another immature way
of doing this? With that reasoning, I ended up using the related planner
functions directly.

However, I assume the current approach of using low-level functions in the
> common case was chosen for performance reasons.
>

That's partially the reason. If you look at the patch, we use the planner
(or the low level functions) infrequently. It is only called when the
logical replication relation cache is rebuilt. As far as I can see, that
happens with (auto) ANALYZE or DDLs etc. I expect these are infrequent
operations. Still, I wanted to make sure we do not create too much overhead
even if there are frequent invalidations.

The main reason for using the low level functions over the planner itself
is to have some more control over the decision. For example, due to the
execution limitations, we currently cannot allow an index that consists of
only expressions (similar to pkey restriction). With the current approach,
we can easily filter those out.

Also, another minor reason is that, if we use planner, we'd get a
PlannedStmt back. It also felt weird to check back the index used from a
PlannedStmt. In the current patch, we iterate over Paths, which seems more
intuitive to me.

> I suppose the options are:
> 1. use regular planner uniformly
> 2. use regular planner only when there's no replica identity (or
> configurable?)
> 3. only use low-level functions
> 4. keep using sequential scans for every single updated row
> 5. introduce a hidden logical row identifier in the heap that is
> guaranteed unique within a table and can be used as a replica identity when
> no unique index exists
>

One other option I considered was to ask the index explicitly on the
subscriber side from the user when REPLICA IDENTITY is FULL. But, it is a
pretty hard choice for any user, even a planner sometimes fails to pick the
right index :) Also, it is probably controversial to change any of the
APIs for this purpose?

I'd be happy to hear from more experienced hackers on the trade-offs for
the above, and I'd be open to work on that if there is a clear winner. For
me (3) is a decent solution for the problem.

Thanks,
Onder

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nitin Jadhav 2023-01-27 13:16:43 Re: Improve GetConfigOptionValues function
Previous Message Nitin Jadhav 2023-01-27 12:56:28 Add a test case related to the error "cannot fetch toast data without an active snapshot"