error: database is not accepting commands to avoid wraparound data loss in database ....

From: Josef Machytka <josef(dot)machytka(at)gmail(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: error: database is not accepting commands to avoid wraparound data loss in database ....
Date: 2019-03-20 11:01:22
Message-ID: CAGvVEFtZ1L9Lv=UTngA2uke71X78S+q95vPMU-v9ioUo6ygC_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi,
I would very much appreciate any help / advise with this problem which
includes logical replication and transaction IDs wraparound error:

We have production database collecting metrics data from web widgets -
hundreds of millions of records per day, most of them inserted in separate
transactions. Data are partitioned by day, we have 10 different parent
tables for different types of metrics.

Collected data are also copied into our main data warehouse database.
Previously we synchronized daily partitions 1x per day but this became very
heavy task because partitions have like 20 - 40 GB each. In some days even
more.... So even with parallel run of dump-restore it took several hours to
synchronize data causing quite a big load on both replica dedicated just
for sync and data warehouse.

Therefore I implemented logical replication to shift data continuously from
collection database into our main data warehouse database + one other which
contains only some data. Publication and subscription contained always
partitions for tomorrow, today + 2 days backwards because sometimes we
experience quite big delays in delivering of data.

It worked absolutely great for some time but in last several days we had
really heavy spikes in usage of widgets so we collected much more records
and we ended up with both data warehouse database being blocked by
transaction IDs wraparound error.

It all happened very quickly during the night and once we got first alerts
databases were already in troubles. We so far did not have alerts for
messages "database "xxxxxx" must be vacuumed within xxxxxxx transactions"
because we so far never encountered them. It would help of course - I will
implement them once I solve main problem....

We have so far never seen this wraparound problem before. We also never had
it on collecting database even with this enormous amount of transactions.

But it looks like subscriptions on data warehouse databases conflicted with
autovacuum of partitions. Logical replication most likely transfers all
transactions 1:1 so suddenly we had incredible number of separate
transactions on data warehouse - which never happened before.

So we can now only select data from both data warehouse databases but we
cannot do anything else - we always get message:

error: database is not accepting commands to avoid wraparound data loss in
database "xxxxx"

Disabling and dropping subscription does not work either - same message.

Because logical replication was not running we have seen huge accumulation
of unarchived WAL segments on collecting database - they have been blocked
by logical replication slots. Because I needed to avoid any problems on
collecting database I decided to drop all publications and logical
replication slots there. This immediately solved the problem with WAL
segments.

I checked solution for wraparound error and since all sources I found
suggested to do VACUUM under single-user mode I decided to test it on
second smaller data warehouse database (~16TB of data) which is not so
crucial for us to have it still running since I knew it will be
inaccessible during that time. Vacuum is running for 24 hours already on
this database and I can only wait for it to finish....

But our main data warehouse database is even bigger ~22TB of data and I
cannot shut it down for several days. Therefore I started to prepare
replacement for it - since we can still read data from it I can copy them
somewhere else. I needed it also for some daily aggregations - so new
database currently contains only several days of data. Question is what to
do next....

Data warehouse database uses standard settings for vacuuming:

autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 50
autovacuum_freeze_max_age 200000000
autovacuum_max_workers 3
autovacuum_multixact_freeze_max_age 400000000
autovacuum_naptime 60
autovacuum_vacuum_cost_delay 20
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 50
autovacuum_work_mem -1
log_autovacuum_min_duration -1
vacuum_cleanup_index_scale_factor 0.1
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_defer_cleanup_age 0
vacuum_freeze_min_age 50000000
vacuum_freeze_table_age 150000000
vacuum_multixact_freeze_min_age 5000000
vacuum_multixact_freeze_table_age 150000000

Can I do something else with main data warehouse database than just this
painfully slow VACUUM? If I would change some settings for freeze would it
help in way that database would accept commands again so I would be able to
drop subscription and vacuum it in normal way?

Thank you very much in advance for any help / advice you might have.

Josef Machytka
Berlin

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Shreeyansh Dba 2019-03-20 12:52:22 Re: PostgreSQL high availability solutions for high rates.
Previous Message Shreeyansh Dba 2019-03-20 02:00:11 Re: Adding Column on Huge Table