Re: Return of the pg_wal issue..

From: Saul Perdomo <saul(dot)perdomo(at)gmail(dot)com>
To: Paul Brindusa <paulbrindusa88(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Return of the pg_wal issue..
Date: 2025-01-23 14:26:51
Message-ID: CAN3jBgFEX-fhXuNkrMYwCeWjtYK2_zSrvEmefkUZciLPHK7Psw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hey Paul,

Regarding

*"I've not managed to test the queries out yet. But I am planning to test
out in my lab environment.*
*Sorry am really cautious about this as those are the main production
databases."*

As a dispassionate third-party observer, I can confirm that all SELECT and
SHOW queries from Laurenz's blog post are read-only. They're completely
safe to run in the affected environment.

On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <paulbrindusa88(at)gmail(dot)com>
wrote:

> Hopefully the below is going to give a little bit more insight on the
> issue.
> I will mention as well that the cluster also replicates data to another
> mysql database if it's relevant at all.
> Also worth noting this is our production cluster and we have another
> pre-production cluster with basically the same settings and the issue there
> does not occur.
>
> A good deal more information is needed to troubleshoot this:
>
> 1) Postgres version(s).
>
> postgres (PostgreSQL) 15.10
>
> 2) The Patroni version.
>
> patroni 4.0.4
>
> 3) The Patroni configuration.
>
> scope: postgres-cluster
> name: db01
> namespace: /service/
>
> log:
> level: INFO
> traceback_level: ERROR
> format: "%(asctime)s %(levelname)s: %(message)s"
> dateformat: ""
> max_queue_size: 1000
> dir: /var/log/patroni
> file_num: 4
> file_size: 25000000
> loggers:
> patroni.postmaster: WARNING
> urllib3: WARNING
>
> restapi:
> listen: x.x.x.98:8008
> connect_address: x.x.x.98:8008
>
> etcd3:
> hosts: db01.local:2379,db02.local:2379,db03.local:2379
>
>
> bootstrap:
> dcs:
> ttl: 30
> loop_wait: 10
> retry_timeout: 10
> maximum_lag_on_failover: 1048576
> postgresql:
> use_pg_rewind: true
> use_slots: true
> parameters:
> max_connections: 500
> superuser_reserved_connections: 5
> password_encryption: scram-sha-256
> max_locks_per_transaction: 512
> max_prepared_transactions: 0
> huge_pages: try
> shared_buffers: 128MB
> effective_cache_size: 4GB
> work_mem: 128MB
> maintenance_work_mem: 256MB
> checkpoint_timeout: 15min
> checkpoint_completion_target: 0.9
> min_wal_size: 80MB
> max_wal_size: 1GB
> wal_buffers: 32MB
> default_statistics_target: 1000
> seq_page_cost: 1
> random_page_cost: 4
> effective_io_concurrency: 2
> synchronous_commit: on
> autovacuum: on
> autovacuum_max_workers: 5
> autovacuum_vacuum_scale_factor: 0.01
> autovacuum_analyze_scale_factor: 0.01
> autovacuum_vacuum_cost_limit: 500
> autovacuum_vacuum_cost_delay: 2
> autovacuum_naptime: 1s
> max_files_per_process: 4096
> archive_mode: on
> archive_timeout: 1800s
> archive_command: cd .
> wal_level: replica
> wal_keep_size: 2GB
> max_wal_senders: 10
> max_replication_slots: 10
> hot_standby: on
> wal_log_hints: on
> wal_compression: on
> shared_preload_libraries: pgaudit
> track_io_timing: on
> log_lock_waits: on
> log_temp_files: 0
> track_activities: on
> track_counts: on
> track_functions: all
> log_checkpoints: on
> logging_collector: on
> log_truncate_on_rotation: on
> log_rotation_age: 1d
> log_rotation_size: 1GB
> log_line_prefix: '%m [%p]: [%l-1] db=%d,user=%u,app=%a,client=%h '
> log_filename: postgresql-%Y-%m-%d.log
> log_directory: /var/log/pgsql
> log_connections: on
> log_disconnections: on
> log_statement: ddl
> log_error_verbosity: verbose
> hot_standby_feedback: on
> max_standby_streaming_delay: 30s
> wal_receiver_status_interval: 10s
> idle_in_transaction_session_timeout: 10min
> jit: off
> max_worker_processes: 24
> max_parallel_workers: 8
> max_parallel_workers_per_gather: 2
> max_parallel_maintenance_workers: 2
>
> initdb:
> - encoding: UTF8
> - data-checksums
>
> pg_hba:
> - host replication replicator 127.0.0.1/32 md5
>
> - host replication replicator x.x.x.98/27 scram-sha-256
>
>
>
> - host replication replicator x.x.x.99/27 scram-sha-256
>
>
>
> - host replication replicator x.x.x.100/27 scram-sha-256
>
>
> - host all all 0.0.0.0/0 md5
>
> postgresql:
> listen: x.x.x.98:5432
> connect_address: x.x.x.98:5432
> data_dir: /var/lib/pgsql/data
> bin_dir: /usr/bin
> pgpass: /var/lib/pgsql/.pgpass_patroni
> authentication:
> replication:
> username: replicator
> password: password
> superuser:
> username: postgres
> password: password
> parameters:
> unix_socket_directories: /var/run/postgresql
>
> remove_data_directory_on_rewind_failure: false
> remove_data_directory_on_diverged_timelines: false
>
> create_replica_methods:
> - basebackup
> basebackup:
> max-rate: '100M'
> checkpoint: 'fast'
>
> watchdog:
> mode: required
> device: /dev/watchdog
> safety_margin: 5
>
> tags:
> nofailover: false
> noloadbalance: false
> clonefrom: false
> nosync: false
>
> 4) Definition of 'ridiculous rate'.
>
> 1GB / day
>
> 5) Relevant information from the logs.
>
> Below entry is something taken off today's log until this point in time
> which I think it might be relevant. I cannot see any specifics. If there is
> anything else please let me know.
>
> 2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client= LOG:
> 00000: checkpoint starting: time
> 2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
> LOCATION: LogCheckpointStart, xlog.c:6121
> 2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client= LOG:
> 00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s) added,
> 0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s; sync
> files=22, longest=0.002 s, average=0.001 s; distance=776 kB, estimate=56426
> kB
> 2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
> LOCATION: LogCheckpointEnd, xlog.c:6202
> 2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
> db=documentation-database,user=documentation-database-user,app=PostgreSQL
> JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
> 0:<REDACTED> user=documentation-database-user
> database=documentation-database host=<REDACTED> port=56170
>
>
> @Laurenz
>
> I guess you are referring to
> https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/
>
> *Yes, that is the one.*
>
> I listed all the reasons I know for your predicament.
> Did you do some research along these lines?
>
> *I've had a look at the things that you have mentioned in the guide. *
>
> If yes, what did you find?
>
> *I've not managed to test the queries out yet. But I am planning to test
> out in my lab environment.*
> *Sorry am really cautious about this as those are the main production
> databases.*
>
> *Hope the above is going to give a bit of insight on the root cause of the
> problem.*
>
>
>
> Yours,
> Laurenz Albe
>
>
>
> On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
> wrote:
>
>> On 1/22/25 09:33, Paul Brindusa wrote:
>> > Good afternoon,
>> >
>> > Following below we are facing a similar issue and im getting a real
>> buzz
>> > to get this working myself, speaking to my DBA in the company has
>> > actually left me a bit cold as he is not good with postgres.
>> >
>> > So I want to try and get a solution for this and fix this issue with
>> the
>> > pg_wal files filling up the drive at a ridiculous rate. I have been
>> > manually moving logs to a different directory but have had no luck in
>> > finding an actual solution.
>> >
>> > The cluster is a 3 node cluster with HA which is running wirth patroni.
>> >
>> > Please help me out, I will mention that I have test cluster spun up in
>> > case something needs testing.
>> >
>> > Also want to give a shout out to Lorenz Albe's for posting stuff about
>> > wal files on his company blog.
>> >
>> > Again any help will be greatly appreciated.
>>
>> A good deal more information is needed to troubleshoot this:
>>
>> 1) Postgres version(s).
>>
>> 2) The Patroni version.
>>
>> 3) The Patroni configuration.
>>
>> 4) Definition of 'ridiculous rate'.
>>
>> 5) Relevant information from the logs.
>>
>> >
>> >
>> > " On one of our postgres instances we have the pg_wal/data folder up to
>> > 196GB, out of 200GB disk filled up.
>> > This has stopped the posgresql.service this morning causing two
>> > applications to crash.
>> > Unfortunately our database admin is on leave today, and we are trying
>> to
>> > figure out how to get the disk down?
>> > Any ideas or suggestions are more than welcome.
>> >
>> > Thank you in advance."
>> >
>> >
>> > --
>> > Kind Regards,
>> > Paul Brindusa
>> > paulbrindusa88(at)gmail(dot)com <mailto:paulbrindusa88(at)gmail(dot)com>
>> >
>>
>> --
>> Adrian Klaver
>> adrian(dot)klaver(at)aklaver(dot)com
>>
>>
>
> --
> Kind Regards,
> Paul Brindusa
> paulbrindusa88(at)gmail(dot)com
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Paul Brindusa 2025-01-23 14:40:40 Re: Return of the pg_wal issue..
Previous Message Paul Brindusa 2025-01-23 11:40:16 Re: Return of the pg_wal issue..