Re: Collect statistics about conflicts in logical replication

From: Peter Smith <smithpb2250(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Subject: Re: Collect statistics about conflicts in logical replication
Date: 2024-08-27 02:58:58
Message-ID: CAHut+Ps0DjmFLPRJOa3JmdNDNXfnix_=_+E=RaQK+r8cGjidKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Aug 26, 2024 at 10:13 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Monday, August 26, 2024 3:30 PM Peter Smith <smithpb2250(at)gmail(dot)com> wrote:
> >
> > ======
> > src/include/replication/conflict.h
> >
> > nit - defined 'NUM_CONFLICT_TYPES' inside the enum (I think this way is
> > often used in other PG source enums)
>
> I think we have recently tended to avoid doing that, as it has been commented
> that this style is somewhat deceptive and can cause confusion. See a previous
> similar comment[1]. The current style follows the other existing examples like:
>
> #define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
> #define IOCONTEXT_NUM_TYPES (IOCONTEXT_VACUUM + 1)
> #define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
> #define BACKEND_NUM_TYPES (B_LOGGER + 1)

OK.

>
>
> > ======
> > src/test/subscription/t/026_stats.pl
> >
> > 1.
> > + # Delete data from the test table on the publisher. This delete
> > + operation # should be skipped on the subscriber since the table is already
> > empty.
> > + $node_publisher->safe_psql($db, qq(DELETE FROM $table_name;));
> > +
> > + # Wait for the subscriber to report tuple missing conflict.
> > + $node_subscriber->poll_query_until(
> > + $db,
> > + qq[
> > + SELECT update_missing_count > 0 AND delete_missing_count > 0 FROM
> > + pg_stat_subscription_stats WHERE subname = '$sub_name'
> > + ])
> > + or die
> > + qq(Timed out while waiting for tuple missing conflict for
> > subscription '$sub_name');
> >
> > Can you write a comment to explain why the replicated DELETE is
> > expected to increment both the 'update_missing_count' and the
> > 'delete_missing_count'?
>
> I think the comments several lines above the wait explained the reason[2]. I
> slightly modified the comments to make it clear.
>

1.
Right, but it still was not obvious to me what caused the
'update_missing_count'. On further study, I see it was a hangover from
the earlier UPDATE test case which was still stuck in an ERROR loop
attempting to do the update operation. e.g. before it was giving the
expected 'update_exists' conflicts but after the subscriber table
TRUNCATE the update conflict changes to give a 'update_missing'
conflict instead. I've updated the comment to reflect my
understanding. Please have a look to see if you agree.

~~~~

2.
I separated the tests for 'update_missing' and 'delete_missing',
putting the update_missing test *before* the DELETE. I felt the
expected results were much clearer when each test did just one thing.
Please have a look to see if you agree.

~~~

3.
+# Enable track_commit_timestamp to detect origin-differ conflicts in logical
+# replication. Reduce wal_retrieve_retry_interval to 1ms to accelerate the
+# restart of the logical replication worker after encountering a conflict.
+$node_subscriber->append_conf(
+ 'postgresql.conf', q{
+track_commit_timestamp = on
+wal_retrieve_retry_interval = 1ms
+});

Later, after CDR resolvers are implemented, it might be good to
revisit these conflict test cases and re-write them to use some
conflict resolvers like 'skip'. Then the subscriber won't give ERRORs
and restart apply workers all the time behind the scenes, so you won't
need the above configuration for accelerating the worker restarts. In
other words, running these tests might be more efficient if you can
avoid restarting workers all the time.

I suggest putting an XXX comment here as a reminder that these tests
should be revisited to make use of conflict resolvers in the future.

~~~

nit - not caused by this patch, but other comment inconsistencies
about "stats_reset timestamp" can be fixed in passing too.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment Content-Type Size
PS_NITPICKS_20240827_CDR_STATS_v2.txt text/plain 3.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2024-08-27 03:19:43 Re: Doc: fix the note related to the GUC "synchronized_standby_slots"
Previous Message Tom Lane 2024-08-27 02:03:47 Re: Significant Execution Time Difference Between PG13.14 and PG16.4 for Query on information_schema Tables.