Re: BUG #18433: Logical replication timeout

From: Kostiantyn Tomakh <tomahkvt(at)gmail(dot)com>
To: Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18433: Logical replication timeout
Date: 2024-05-01 17:25:35
Message-ID: CAJP09w7sJEkoO2=DUi-BS_JJpv7dCDcj+kb7V5YCiW7d1Y2KZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello, Shlok Kyal.

I was able to reproduce the problem.
I did it on docker based platform I hope you will be able to reproduce this
problem too.
1) I ran the docker-compose file with PostgreSQL 10:
docker-compose.yml:

version: '3.3'
services:
postgres:
image: postgres:10.21
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5432:5432"
postgres2:
image: postgres:10.21
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data2:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5433:5432"

postgresql.conf:

hba_file = '/etc/pg_hba.conf'

listen_addresses = '*'

wal_level = logical

pg_hba.conf:

# TYPE DATABASE USER ADDRESS METHOD
host all all 0.0.0.0/0 md5

So I have two Postgresql servers.

2) I created table at DB "postgres":

CREATE TABLE public.table (
field_1 uuid NOT NULL,
field_2 uuid NOT NULL,
field_3 uuid NOT NULL,
field_4 text NOT NULL,
field_5 text NOT NULL,
field_6 timestamp with time zone NOT NULL,
field_7 timestamp with time zone NOT NULL,
field_8 smallint NOT NULL,
field_9 text NOT NULL,
field_10 text NOT NULL,
field_11 uuid NOT NULL,
field_12 uuid,
field_13 text,
field_14 uuid,
field_15 text NOT NULL,
field_16 integer,
field_17 text NOT NULL,
field_18 uuid,
field_19 character varying(20),
field_20 character varying(29),
field_21 text,
field_22 bigint,
field_23 character varying(20)
);

ALTER TABLE ONLY public.table
ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);

3) Created publication at DB "postgres":

CREATE ROLE replication_role WITH REPLICATION LOGIN PASSWORD 'password';

GRANT ALL PRIVILEGES ON DATABASE "db" TO replication_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_role;
CREATE PUBLICATION db_name_public_publication;
ALTER PUBLICATION db_name_public_publication ADD TABLE "table";

4) I started two Python scripts that were generating requests to"postgres":

insert.py:

import psycopg2
from faker import Faker
import uuid
import random
from datetime import datetime, timedelta

# Connect to your PostgreSQL database
conn = psycopg2.connect(
dbname="db",
user="user",
password="password",
host="127.0.0.1",
port="5432"
)
cur = conn.cursor()

# Function to generate fake data for one row
def generate_fake_row():
fake = Faker()
row = (
str(uuid.uuid4()), # Convert UUID to string
str(uuid.uuid4()), # Convert UUID to string
str(uuid.uuid4()), # Convert UUID to string
fake.text(), # field_4
fake.text(), # field_5
fake.date_time_between(start_date='-1y', end_date='now',
tzinfo=None), # field_6
fake.date_time_between(start_date='-1y', end_date='now',
tzinfo=None), # field_7
random.randint(0, 100), # field_8
fake.text(), # field_9
fake.text(), # field_10
str(uuid.uuid4()), # Convert UUID to string
str(uuid.uuid4()) if random.random() > 0.5 else None, #
Convert UUID to string
fake.text() if random.random() > 0.5 else None, # field_13
str(uuid.uuid4()) if random.random() > 0.5 else None, #
Convert UUID to string
fake.text(), # field_15
random.randint(0, 1000), # field_16
fake.text(), # field_17
str(uuid.uuid4()) if random.random() > 0.5 else None, #
Convert UUID to string
fake.random_letter() * 20, # field_19
fake.random_letter() * 29, # field_20
fake.text() if random.random() > 0.5 else None, # field_21
random.randint(0, 1000000), # field_22
fake.random_letter() * 20 # field_23
)
return row

# Function to insert one row of fake data into the table
def insert_fake_row(row):
sql = """INSERT INTO public.table (
field_1, field_2, field_3, field_4, field_5, field_6, field_7, field_8,
field_9, field_10, field_11, field_12, field_13, field_14,
field_15, field_16,
field_17, field_18, field_19, field_20, field_21, field_22, field_23
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
%s, %s, %s, %s, %s, %s, %s, %s, %s)"""
cur.execute(sql, row)
conn.commit()

# Insert one row at a time in an independent loop
while True:
fake_row = generate_fake_row() # Generate fake data for one row
insert_fake_row(fake_row) # Insert the row

# Close the cursor and the connection
cur.close()
conn.close()

update.py:

import psycopg2
import random
import time

# Connect to your PostgreSQL database
conn = psycopg2.connect(
dbname="db",
user="user",
password="password",
host="127.0.0.1",
port="5432"
)
cur = conn.cursor()

# Function to update any row in the table
def update_random_row():
# Generate a random value for field_8 and field_16
new_field_8 = random.randint(0, 100)
new_field_16 = random.randint(0, 1000)

# Choose a random row to update
cur.execute("SELECT field_1 FROM public.table;")
rows = cur.fetchall()
random_row = random.choice(rows)[0]

# Update the chosen row with the new values
cur.execute("UPDATE public.table SET field_8 = %s, field_16 = %s
WHERE field_1 = %s;", (new_field_8, new_field_16, random_row))
conn.commit()

# Infinite loop to continuously update rows
while True:
update_random_row()
time.sleep(1) # Wait for 1 second before the next update

# Close the cursor and the connection (won't be executed in an infinite loop)
cur.close()
conn.close()

5)Create subsription at DB postgres2:

CREATE SUBSCRIPTION db_name_public_subscription CONNECTION
'host=postgres port=5432 password=password user=replication_role
dbname=db' PUBLICATION db_name_public_publication;

6) Then I got some messages in logs:

10-postgres2-1 | 2024-05-01 16:34:16.296 GMT [38] LOG: logical
replication apply worker for subscription
"db_name_public_subscription" has started
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] LOG: starting
logical decoding for slot "db_name_public_subscription"
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] DETAIL: streaming
transactions committing after 0/1B5FFC8, reading WAL from 0/1B5FF90
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] LOG: logical
decoding found consistent point at 0/1B5FF90
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] DETAIL: There are
no running transactions.
10-postgres-1 | 2024-05-01 16:34:16.300 GMT [39] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
10-postgres2-1 | 2024-05-01 16:34:16.302 GMT [39] LOG: logical
replication table synchronization worker for subscription
"db_name_public_subscription", table "table" has started
10-postgres-1 | 2024-05-01 16:34:16.366 GMT [40] LOG: logical
decoding found consistent point at 0/1B610D8
10-postgres-1 | 2024-05-01 16:34:16.366 GMT [40] DETAIL: There are
no running transactions.
10-postgres-1 | 2024-05-01 16:34:16.366 GMT [40] STATEMENT:
CREATE_REPLICATION_SLOT "db_name_public_subscription_16391_sync_16385"
TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] LOG: starting
logical decoding for slot
"db_name_public_subscription_16391_sync_16385"
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] DETAIL: streaming
transactions committing after 0/1B61110, reading WAL from 0/1B610D8
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription_16391_sync_16385"
LOGICAL 0/1B61110 (proto_version '1', publication_names
'"db_name_public_publication"')
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] LOG: logical
decoding found consistent point at 0/1B610D8
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] DETAIL: There are
no running transactions.
10-postgres-1 | 2024-05-01 16:34:16.478 GMT [40] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription_16391_sync_16385"
LOGICAL 0/1B61110 (proto_version '1', publication_names
'"db_name_public_publication"')
10-postgres2-1 | 2024-05-01 16:34:16.479 GMT [39] LOG: logical
replication table synchronization worker for subscription
"db_name_public_subscription", table "table" has finished
10-postgres2-1 | 2024-05-01 16:34:16.782 GMT [38] ERROR: logical
replication target relation "public.table" has neither REPLICA
IDENTITY index nor PRIMARY KEY and published relation does not have
REPLICA IDENTITY FULL
10-postgres2-1 | 2024-05-01 16:34:16.783 GMT [1] LOG: worker
process: logical replication worker for subscription 16391 (PID 38)
exited with exit code 1
10-postgres2-1 | 2024-05-01 16:34:21.788 GMT [40] LOG: logical
replication apply worker for subscription
"db_name_public_subscription" has started
10-postgres-1 | 2024-05-01 16:34:21.791 GMT [41] LOG: starting
logical decoding for slot "db_name_public_subscription"

After this logs I had table size in DB postgres 50 Mbytes and table
size in DB postgres2 45 Mbytes

7) When I created Primary key at DB postgres2 tables at DB postgres
and DB postgres2 was the same size:

ALTER TABLE ONLY public.table
ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);

Result: I had fully consistent data at DB postgres2 and normal
replication process.

Then I repeated the steps for Postgresql 13.14.

1) I ran the dockerfile creation file with PostgreSQL 13:
docker-compose.yml:

version: '3.3'
services:
postgres:
image: postgres:13.14
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5432:5432"
postgres2:
image: postgres:13.14
environment:
POSTGRES_DB: "db"
POSTGRES_USER: "user"
POSTGRES_PASSWORD: "password"
PGDATA: "/var/lib/postgresql/data/pgdata"
command: -c config_file=/etc/postgresql.conf
volumes:
- ./data2:/var/lib/postgresql/data
- ./postgresql.conf:/etc/postgresql.conf
- ./pg_hba.conf:/etc/pg_hba.conf
ports:
- "5433:5432"

So I have two Postgresql servers.

postgresql.conf:

hba_file = '/etc/pg_hba.conf'

listen_addresses = '*'

ssl = off

wal_level = logical

pg_hba.conf:

# TYPE DATABASE USER ADDRESS METHOD
host all all all md5

2) I created table at DB "postgres":

CREATE TABLE public.table (
field_1 uuid NOT NULL,
field_2 uuid NOT NULL,
field_3 uuid NOT NULL,
field_4 text NOT NULL,
field_5 text NOT NULL,
field_6 timestamp with time zone NOT NULL,
field_7 timestamp with time zone NOT NULL,
field_8 smallint NOT NULL,
field_9 text NOT NULL,
field_10 text NOT NULL,
field_11 uuid NOT NULL,
field_12 uuid,
field_13 text,
field_14 uuid,
field_15 text NOT NULL,
field_16 integer,
field_17 text NOT NULL,
field_18 uuid,
field_19 character varying(20),
field_20 character varying(29),
field_21 text,
field_22 bigint,
field_23 character varying(20)
);

ALTER TABLE ONLY public.table
ADD CONSTRAINT "PK_table" PRIMARY KEY (field_1);

3) Created publication at DB "postgres":

CREATE ROLE replication_role WITH REPLICATION LOGIN PASSWORD 'password';

GRANT ALL PRIVILEGES ON DATABASE "db" TO replication_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO replication_role;
CREATE PUBLICATION db_name_public_publication;
ALTER PUBLICATION db_name_public_publication ADD TABLE "table";

4) I started two Python scripts that were generating requests to"postgres".
The scripts insert.py and update.py were described earlier.

5) Create subsription at DB postgres2:

CREATE SUBSCRIPTION db_name_public_subscription CONNECTION
'host=postgres port=5432 password=password user=replication_role
dbname=db' PUBLICATION db_name_public_publication;

6) Then I got some messages in logs:

13-postgres-1 | 2024-05-01 16:26:15.348 GMT [40] STATEMENT:
CREATE_REPLICATION_SLOT "db_name_public_subscription" LOGICAL pgoutput
NOEXPORT_SNAPSHOT
13-postgres2-1 | 2024-05-01 16:26:15.378 GMT [39] LOG: logical
replication apply worker for subscription
"db_name_public_subscription" has started
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] LOG: starting
logical decoding for slot "db_name_public_subscription"
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] DETAIL: Streaming
transactions committing after 0/A0E10F0, reading WAL from 0/A0E1048.
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] LOG: logical
decoding found initial starting point at 0/A0E1048
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] DETAIL: Waiting
for transactions (approximately 1) older than 66255 to end.
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] LOG: logical
decoding found consistent point at 0/A0E10B8
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] DETAIL: There are
no running transactions.
13-postgres-1 | 2024-05-01 16:26:15.382 GMT [41] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription" LOGICAL 0/0
(proto_version '1', publication_names '"db_name_public_publication"')
13-postgres2-1 | 2024-05-01 16:26:15.384 GMT [40] LOG: logical
replication table synchronization worker for subscription
"db_name_public_subscription", table "table" has started
13-postgres-1 | 2024-05-01 16:26:15.476 GMT [42] LOG: logical
decoding found consistent point at 0/A0E45D8
13-postgres-1 | 2024-05-01 16:26:15.476 GMT [42] DETAIL: There are
no running transactions.
13-postgres-1 | 2024-05-01 16:26:15.476 GMT [42] STATEMENT:
CREATE_REPLICATION_SLOT "db_name_public_subscription_16500_sync_16385"
TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] LOG: starting
logical decoding for slot
"db_name_public_subscription_16500_sync_16385"
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] DETAIL: Streaming
transactions committing after 0/A0E4610, reading WAL from 0/A0E45D8.
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385"
LOGICAL 0/A0E4610 (proto_version '1', publication_names
'"db_name_public_publication"')
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] LOG: logical
decoding found consistent point at 0/A0E45D8
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] DETAIL: There are
no running transactions.
13-postgres-1 | 2024-05-01 16:26:16.993 GMT [42] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385"
LOGICAL 0/A0E4610 (proto_version '1', publication_names
'"db_name_public_publication"')
13-postgres2-1 | 2024-05-01 16:26:16.994 GMT [40] ERROR: logical
replication target relation "public.table" has neither REPLICA
IDENTITY index nor PRIMARY KEY and published relation does not have
REPLICA IDENTITY FULL
13-postgres-1 | 2024-05-01 16:26:16.994 GMT [42] LOG: could not
send data to client: Connection reset by peer
13-postgres-1 | 2024-05-01 16:26:16.994 GMT [42] CONTEXT: slot
"db_name_public_subscription_16500_sync_16385", output plugin
"pgoutput", in the commit callback, associated LSN 0/A0F0F58
13-postgres-1 | 2024-05-01 16:26:16.994 GMT [42] STATEMENT:
START_REPLICATION SLOT "db_name_public_subscription_16500_sync_16385"
LOGICAL 0/A0E4610 (proto_version '1', publication_names
'"db_name_public_publication"')
13-postgres2-1 | 2024-05-01 16:26:16.996 GMT [1] LOG: background
worker "logical replication worker" (PID 40) exited with exit code 1
13-postgres2-1 | 2024-05-01 16:26:20.393 GMT [41] LOG: logical
replication table synchronization worker for subscription
"db_name_public_subscription", table "table" has started
13-postgres-1 | 2024-05-01 16:26:20.461 GMT [43] LOG: logical
decoding found consistent point at 0/A12E068
13-postgres-1 | 2024-05-01 16:26:20.461 GMT [43] DETAIL: There are
no running transactions.

After these logs I had table size in DB postgres 50 Mbytes, table size
in DB postgres2 450 Mbytes and table size in DB postgres2 was
increasing.

I was checking tables size by using query:

select
table_schema,
table_name,
pg_relation_size('"'||table_schema||'"."'||table_name||'"'),
pg_size_pretty(pg_relation_size('"'||table_schema||'"."'||table_name||'"'))
from information_schema.tables
where table_schema in ('public', 'saga', 'hangfire')
order by 3 DESC

Please help to resolve this problem this will allow to migrate big-size tables.

I am looking forward to your answer.

I also attached files with logs to this email.

пн, 29 апр. 2024 г. в 13:14, Shlok Kyal <shlok(dot)kyal(dot)oss(at)gmail(dot)com>:

> Hi
>
> > 3. This log was got for a table size of 39 GB but we have tables much
> bigger than this.
> >
> > Our workflow for the tables size 39 GB
> >
> > 1) Download schema from the source database instance
> >
> > 2) Deleted PK, FK, and Indexes from the table schema(we did it to
> increase data load, startup process)
> > 3)Upload the schema to the destination DB.
> > 4) configurated identity replication full at source DB for this table
> > 5) Configured Logical replication between source and destination DB this
> table
> > 6) During catchup on this table process we got the messages in log that
> you can see above.
> >
> > 7) We also tried create primary key for this table during catchup state
> but this process was blocked by logical replication worker so we had to
> cancel primary key creation process.
> >
> > I want to highlight that we used this workflow for PostgreSQL 10 and it
> worked.
> >
>
> I tried the steps you provided to reproduce the issue on Postgresql 13
> but, I was not able to reproduce it. I tried it for tables with size
> ~120GB and also tried it separately for table with size ~40GB.
>
> Steps I used:
> 1. Created a sourceDB and DestinationDB.
> 2.
> Populated sourceDB with data.
> "create table t1(c1 int PRIMARY KEY, c2 int);"
> "insert into t1 values(generate_series(1,1500000000), 2)"
> "update t1 set c1 = 0 where c1 = 1"
> 3.
> Deleted Primary Key on sourceDB
> "alter table t1 drop constraint t1_pkey"
> 4.
> Created schema on destinationDB
> "create table t1(c1 int, c2 int);"
> 5.
> Configured Replica identity full on sourceDB
> "ALTER TABLE t1 REPLICA IDENTITY FULL"
> 6.
> Configured Logical Replication Setup
> on sourceDB: "create publication pub1_new for all tables"
> on destinationDB: ""create subscription test1_new connection
> 'dbname=postgres host=localhost port=5432' publication pub1_new;"
>
> For me the initial catchup was successful and I did not receive any
> issue in sourceDB logs or destinationDB logs.
> Can you point out if you did anything different?
>
> Also, For me the initial catchup succeeds even if I did not configure
> replica identity full on sourceDB
> Can you point out how your steps were different to get error related
> to replica identity full?
>
> Also, are you running any queries concurrently on sourceDB or
> destinationDB?
>
> Thanks and Regards,
> Shlok Kyal
>

Attachment Content-Type Size
logs_postgresql10.txt text/plain 26.3 KB
logs_postgresql13.txt text/plain 17.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bruce Momjian 2024-05-01 18:21:06 Re: BUG #18433: Logical replication timeout
Previous Message David G. Johnston 2024-05-01 15:25:41 Re: BUG #18451: NULL fails to coerce to string when performing string comparison