Understanding logical replication lag

From: Patrick Molgaard <draaglom(at)gmail(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Understanding logical replication lag
Date: 2021-01-21 18:21:20
Message-ID: CA+Zan4uOGR2PmCYowN6hF-qi=x3_KTkr3SgJ=D23STN4epGWSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi all,

*My top-level query is*: I'm using logical replication under pg 9.6 to do a
kind of change data capture and I'm seeing occasional extended periods of
significant lag. I'm not sure what conceptual model I'm missing in order to
understand why this happens.

*The details:*

I'm running Postgres 9.6.19 from the postgres debian apt repos & the
wal2json extension.

I have a custom client application which essentially executes
pg_logical_slot_get_changes() for some manually-created logical replication
slot on a loop.

I'm monitoring replication lag defined as pg_current_xlog_location() -
confirmed_flush_lsn for that slot.

What I'm observing is - very occasionally - an extended period (hours long)
wherein:

* The normal database write load continues or slightly increases
* calls to pg_logical_slot_get_changes() return no rows and
confirmed_flush_lsn doesn't move
* the duration of a call to pg_logical_slot_get_changes() rises linearly
over time

I understand from the docs and research that this is usually caused by a
long-running write transaction, but I notice I'm still confused.

* I'm not 100% sure - I'm still confirming - but I'm fairly confident that
I don't have any egregiously long write transactions (at least on that
scale of hours). Are there any other common scenarios that can result in a
similar 'blockage'? e.g some categories of long read-only transactions, or
advisory locks, or other kinds of database activity like a vacuum?

* Conversely, from experimenting, it seems as if not all long-running write
transactions cause pg_logical_slot_get_changes() to be unable to advance.
In fact, I'm not able so far to produce a minimal set of simple queries
which show that behaviour.

Given the following sequence of queries I see changes emitted:

-- session 1
begin;
insert into foo(bar,baz) values (1, 1);

-- session 2
begin;
insert into foo(bar,baz) values (2,2);
commit;

-- session 3
select data from
pg_logical_slot_get_changes('example-slot', NULL, NULL, 'format-version',
'2');

Session 3 is able to return the row from session 2 despite session 1's
ongoing transaction starting first and not yet committing. Can you help me
understand (or better yet point me to a resource which explains) the
underlying logic defining how logical decoding does in fact get blocked by
in-flight transactions?

Thanks,
Patrick

Browse pgsql-performance by date

  From Date Subject
Next Message Nagaraj Raj 2021-01-22 01:53:26 Query performance issue
Previous Message Don Seiler 2021-01-15 21:18:48 Re: High COMMIT times