| From: | Önder Kalacı <onderkalaci(at)gmail(dot)com> |
|---|---|
| To: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher |
| Date: | 2022-07-22 16:15:16 |
| Message-ID: | CACawEhWV=H_jDyJPm_CUqkBZNMkK-1gtt3ddShj_7p3eF3iwRQ@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hi,
>
> >
> > One another idea could be to re-calculate the index, say after N
> updates/deletes for the table. We may consider using subscription_parameter
> for getting N -- with a good default, or even hard-code into the code. I
> think the cost of re-calculating should really be pretty small compared to
> the other things happening during logical replication. So, a sane default
> might work?
> >
>
> One difficulty in deciding the value of N for the user or choosing a
> default would be that we need to probably consider the local DML
> operations on the table as well.
>
>
Fair enough, it is not easy to find a good default.
> > If you think the above doesn't work, I can try to work on a separate
> patch which adds something like "analyze invalidation callback".
> >
>
> I suggest we should give this a try and if this turns out to be
> problematic or complex then we can think of using some heuristic as
> you are suggesting above.
>
Alright, I'll try this and respond shortly back.
> >
> > OR use a complex heuristic with a reasonable index pick. And, the latter
> approach converged to the planner code in the patch. Do you think the
> former approach is acceptable?
> >
>
> In this regard, I was thinking in which cases a sequence scan can be
> better than the index scan (considering one is available). I think if
> a certain column has a lot of duplicates (for example, a column has a
> boolean value) then probably doing a sequence scan is better. Now,
> considering this even though your other approach sounds simpler but
> could lead to unpredictable results. So, I think the latter approach
> is preferable.
>
Yes, it makes sense. I also considered this during the development of the
patch, but forgot to mention :)
>
> BTW, do we want to consider partial indexes for the scan in this
> context? I mean it may not have data of all rows so how that would be
> usable?
>
>
As far as I can see, check_index_predicates() never picks a partial index
for the baserestrictinfos we create in CreateReplicaIdentityFullPaths().
The reason is that we have roughly the following call stack:
-check_index_predicates
--predicate_implied_by
---predicate_implied_by_recurse
----predicate_implied_by_simple_clause
-----operator_predicate_proof
And, inside operator_predicate_proof(), there is never going to be an
equality. Because, we push `Param`s to the baserestrictinfos whereas the
index predicates are always `Const`.
If we want to make it even more explicit, I can filter out `Path`s with
partial indexes. But that seems redundant to me. For now, I pushed the
commit with an assertion that we never pick partial indexes and also added
a test.
If you think it is better to explicitly filter out partial indexes, I can
do that as well.
> Few comments:
> ===============
> 1.
> static List *
> +CreateReplicaIdentityFullPaths(Relation localrel)
> {
> ...
> + /*
> + * Rather than doing all the pushups that would be needed to use
> + * set_baserel_size_estimates, just do a quick hack for rows and width.
> + */
> + rel->rows = rel->tuples;
>
> Is it a good idea to set rows without any selectivity estimation?
> Won't this always set the entire rows in a relation? Also, if we don't
> want to use set_baserel_size_estimates(), how will we compute
> baserestrictcost which will later be used in the costing of paths (for
> example, costing of seqscan path (cost_seqscan) uses it)?
>
> In general, I think it will be better to consider calling some
> top-level planner functions even for paths. Can we consider using
> make_one_rel() instead of building individual paths?
Thanks, this looks like a good suggestion/simplification. I wanted to use
the least amount of code possible, and make_one_rel() does either what I
exactly need or slightly more, which is great.
Note that make_one_rel() also follows the same call stack that I noted
above. So, I cannot spot any problems with partial indexes. Maybe am I
missing something here?
> On similar lines,
> in function PickCheapestIndexPathIfExists(), can we use
> set_cheapest()?
>
>
Yes, make_one_rel() + set_cheapest() sounds better. Changed.
> 2.
> @@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
> Relation idxrel,
> int2vector *indkey = &idxrel->rd_index->indkey;
> bool hasnulls = false;
>
> - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
> - RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));
>
> You have removed this assertion but there is a comment ("This is not
> generic routine, it expects the idxrel to be replication identity of a
> rel and meet all limitations associated with that.") atop this
> function which either needs to be changed/removed and probably we
> should think if the function needs some change after removing that
> restriction.
>
>
Ack, I can see your point. I think, for example, we should skip index
attributes that are not simple column references. And, probably whatever
other restrictions that PRIMARY has, should be here.
I'll read some more Postgres code & test before pushing a revision for this
part. In the meantime, if you have any suggestions/pointers for me to look
into, please note here.
Attached v2 of the patch with addressing some of the comments you had. I'll
work on the remaining shortly.
Thanks,
Onder
| Attachment | Content-Type | Size |
|---|---|---|
| v2_0001_use_index_on_subs_when_pub_rep_ident_full.patch | application/octet-stream | 42.8 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Joe Conway | 2022-07-22 16:49:10 | Re: [PATCH] Introduce array_shuffle() and array_sample() |
| Previous Message | Julien Rouhaud | 2022-07-22 16:08:30 | Re: Expose Parallelism counters planned/execute in pg_stat_statements |