Re: Return of the pg_wal issue..

From: Paul Brindusa <paulbrindusa88(at)gmail(dot)com>
To: Saul Perdomo <saul(dot)perdomo(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Return of the pg_wal issue..
Date: 2025-01-23 15:56:48
Message-ID: CAFeSbqhZx-LxgDH48v-LqJd5QGJoQSpRFuGrW92_bTCT+Qv9dA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Wow, hats off to you kind sir!
Will defo look into this. Thank you for your support with this.

On Thu, Jan 23, 2025 at 2:51 PM Saul Perdomo <saul(dot)perdomo(at)gmail(dot)com> wrote:

> In a nutshell: Remember the ticker-tape of old? (You know, the festive
> paper strips tossed out of high-rises by stockbrokers, that float on down
> on city parades after a world war is won in black and white movies?) Those
> were market-transaction serial records.. Similar to those guys, the
> write-ahead logfiles are records of every single DB action that isn't
> read-only.
>
> Now, one could think "well if I've already committed the transactions to
> the database, why do I need to keep the receipts?" And, well, strictly
> speaking you don't -- but you really want to, because they could serve to
> replay, *in perfect order, *the database activity for any given period.
> This data is *gold *when it comes to, say, recovering from disaster.
>
> This is why everybody will tell you "don't just delete these files,
> archive them properly!" Again, for operational purposes, you could just
> delete them. But you really want to make a *copy *of them before you
> do... you know, *just in case *something bad happens to your DB that
> makes you want to roll it back in time.
>
> Enter the "archive_command" function. Instead of simply deleting them, you
> tell PG to pass the files on to another piece of software that knows what
> to do with this highly valuable data. To avoid complicating matters
> further, at this point I'll simply recommend that you use PGBackRest for
> this, it's my favorite piece of software beside the PG server.
>
> Or, you know, you could just delete them. You really don't want to,
> though. Get PGBackRest going and the concepts will click for you as you
> progress along setting it up: https://pgbackrest.org/user-guide.html
>
> Cheers
> Saul
>
> On Thu, Jan 23, 2025 at 9:40 AM Paul Brindusa <paulbrindusa88(at)gmail(dot)com>
> wrote:
>
>> Hi Saul,
>>
>> Fantastic, highly appreciate that.
>> Initially i've taken the top bit of Laurenz's post to understand what
>> these wal files are.
>> Any help is more than welcome, apologies for not checking that bit
>> earlier. In the meantime i've checked those queries as well.
>>
>> Thank you
>>
>>
>> On Thu, Jan 23, 2025 at 2:27 PM Saul Perdomo <saul(dot)perdomo(at)gmail(dot)com>
>> wrote:
>>
>>> Hey Paul,
>>>
>>> Regarding
>>>
>>> *"I've not managed to test the queries out yet. But I am planning to
>>> test out in my lab environment.*
>>> *Sorry am really cautious about this as those are the main production
>>> databases."*
>>>
>>> As a dispassionate third-party observer, I can confirm that all SELECT
>>> and SHOW queries from Laurenz's blog post are read-only. They're completely
>>> safe to run in the affected environment.
>>>
>>> On Thu, Jan 23, 2025 at 6:40 AM Paul Brindusa <paulbrindusa88(at)gmail(dot)com>
>>> wrote:
>>>
>>>> Hopefully the below is going to give a little bit more insight on the
>>>> issue.
>>>> I will mention as well that the cluster also replicates data to another
>>>> mysql database if it's relevant at all.
>>>> Also worth noting this is our production cluster and we have another
>>>> pre-production cluster with basically the same settings and the issue there
>>>> does not occur.
>>>>
>>>> A good deal more information is needed to troubleshoot this:
>>>>
>>>> 1) Postgres version(s).
>>>>
>>>> postgres (PostgreSQL) 15.10
>>>>
>>>> 2) The Patroni version.
>>>>
>>>> patroni 4.0.4
>>>>
>>>> 3) The Patroni configuration.
>>>>
>>>> scope: postgres-cluster
>>>> name: db01
>>>> namespace: /service/
>>>>
>>>> log:
>>>> level: INFO
>>>> traceback_level: ERROR
>>>> format: "%(asctime)s %(levelname)s: %(message)s"
>>>> dateformat: ""
>>>> max_queue_size: 1000
>>>> dir: /var/log/patroni
>>>> file_num: 4
>>>> file_size: 25000000
>>>> loggers:
>>>> patroni.postmaster: WARNING
>>>> urllib3: WARNING
>>>>
>>>> restapi:
>>>> listen: x.x.x.98:8008
>>>> connect_address: x.x.x.98:8008
>>>>
>>>> etcd3:
>>>> hosts: db01.local:2379,db02.local:2379,db03.local:2379
>>>>
>>>>
>>>> bootstrap:
>>>> dcs:
>>>> ttl: 30
>>>> loop_wait: 10
>>>> retry_timeout: 10
>>>> maximum_lag_on_failover: 1048576
>>>> postgresql:
>>>> use_pg_rewind: true
>>>> use_slots: true
>>>> parameters:
>>>> max_connections: 500
>>>> superuser_reserved_connections: 5
>>>> password_encryption: scram-sha-256
>>>> max_locks_per_transaction: 512
>>>> max_prepared_transactions: 0
>>>> huge_pages: try
>>>> shared_buffers: 128MB
>>>> effective_cache_size: 4GB
>>>> work_mem: 128MB
>>>> maintenance_work_mem: 256MB
>>>> checkpoint_timeout: 15min
>>>> checkpoint_completion_target: 0.9
>>>> min_wal_size: 80MB
>>>> max_wal_size: 1GB
>>>> wal_buffers: 32MB
>>>> default_statistics_target: 1000
>>>> seq_page_cost: 1
>>>> random_page_cost: 4
>>>> effective_io_concurrency: 2
>>>> synchronous_commit: on
>>>> autovacuum: on
>>>> autovacuum_max_workers: 5
>>>> autovacuum_vacuum_scale_factor: 0.01
>>>> autovacuum_analyze_scale_factor: 0.01
>>>> autovacuum_vacuum_cost_limit: 500
>>>> autovacuum_vacuum_cost_delay: 2
>>>> autovacuum_naptime: 1s
>>>> max_files_per_process: 4096
>>>> archive_mode: on
>>>> archive_timeout: 1800s
>>>> archive_command: cd .
>>>> wal_level: replica
>>>> wal_keep_size: 2GB
>>>> max_wal_senders: 10
>>>> max_replication_slots: 10
>>>> hot_standby: on
>>>> wal_log_hints: on
>>>> wal_compression: on
>>>> shared_preload_libraries: pgaudit
>>>> track_io_timing: on
>>>> log_lock_waits: on
>>>> log_temp_files: 0
>>>> track_activities: on
>>>> track_counts: on
>>>> track_functions: all
>>>> log_checkpoints: on
>>>> logging_collector: on
>>>> log_truncate_on_rotation: on
>>>> log_rotation_age: 1d
>>>> log_rotation_size: 1GB
>>>> log_line_prefix: '%m [%p]: [%l-1]
>>>> db=%d,user=%u,app=%a,client=%h '
>>>> log_filename: postgresql-%Y-%m-%d.log
>>>> log_directory: /var/log/pgsql
>>>> log_connections: on
>>>> log_disconnections: on
>>>> log_statement: ddl
>>>> log_error_verbosity: verbose
>>>> hot_standby_feedback: on
>>>> max_standby_streaming_delay: 30s
>>>> wal_receiver_status_interval: 10s
>>>> idle_in_transaction_session_timeout: 10min
>>>> jit: off
>>>> max_worker_processes: 24
>>>> max_parallel_workers: 8
>>>> max_parallel_workers_per_gather: 2
>>>> max_parallel_maintenance_workers: 2
>>>>
>>>> initdb:
>>>> - encoding: UTF8
>>>> - data-checksums
>>>>
>>>> pg_hba:
>>>> - host replication replicator 127.0.0.1/32 md5
>>>>
>>>> - host replication replicator x.x.x.98/27 scram-sha-256
>>>>
>>>>
>>>>
>>>> - host replication replicator x.x.x.99/27 scram-sha-256
>>>>
>>>>
>>>>
>>>> - host replication replicator x.x.x.100/27 scram-sha-256
>>>>
>>>>
>>>> - host all all 0.0.0.0/0 md5
>>>>
>>>> postgresql:
>>>> listen: x.x.x.98:5432
>>>> connect_address: x.x.x.98:5432
>>>> data_dir: /var/lib/pgsql/data
>>>> bin_dir: /usr/bin
>>>> pgpass: /var/lib/pgsql/.pgpass_patroni
>>>> authentication:
>>>> replication:
>>>> username: replicator
>>>> password: password
>>>> superuser:
>>>> username: postgres
>>>> password: password
>>>> parameters:
>>>> unix_socket_directories: /var/run/postgresql
>>>>
>>>> remove_data_directory_on_rewind_failure: false
>>>> remove_data_directory_on_diverged_timelines: false
>>>>
>>>> create_replica_methods:
>>>> - basebackup
>>>> basebackup:
>>>> max-rate: '100M'
>>>> checkpoint: 'fast'
>>>>
>>>> watchdog:
>>>> mode: required
>>>> device: /dev/watchdog
>>>> safety_margin: 5
>>>>
>>>> tags:
>>>> nofailover: false
>>>> noloadbalance: false
>>>> clonefrom: false
>>>> nosync: false
>>>>
>>>> 4) Definition of 'ridiculous rate'.
>>>>
>>>> 1GB / day
>>>>
>>>> 5) Relevant information from the logs.
>>>>
>>>> Below entry is something taken off today's log until this point in
>>>> time which I think it might be relevant. I cannot see any specifics. If
>>>> there is anything else please let me know.
>>>>
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [863-1] db=,user=,app=,client=
>>>> LOG: 00000: checkpoint starting: time
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [864-1] db=,user=,app=,client=
>>>> LOCATION: LogCheckpointStart, xlog.c:6121
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [865-1] db=,user=,app=,client=
>>>> LOG: 00000: checkpoint complete: wrote 66 buffers (0.4%); 0 WAL file(s)
>>>> added, 0 removed, 0 recycled; write=6.563 s, sync=0.003 s, total=6.619 s;
>>>> sync files=22, longest=0.002 s, average=0.001 s; distance=776 kB,
>>>> estimate=56426 kB
>>>> 2<REDACTED>:<REDACTED> GMT [186889]: [866-1] db=,user=,app=,client=
>>>> LOCATION: LogCheckpointEnd, xlog.c:6202
>>>> 2<REDACTED>:<REDACTED> GMT [2439188]: [7-1]
>>>> db=documentation-database,user=documentation-database-user,app=PostgreSQL
>>>> JDBC Driver,client=<REDACTED> LOG: 00000: disconnection: session time:
>>>> 0:<REDACTED> user=documentation-database-user
>>>> database=documentation-database host=<REDACTED> port=56170
>>>>
>>>>
>>>> @Laurenz
>>>>
>>>> I guess you are referring to
>>>> https://www.cybertec-postgresql.com/en/why-does-my-pg_wal-keep-growing/
>>>>
>>>> *Yes, that is the one.*
>>>>
>>>> I listed all the reasons I know for your predicament.
>>>> Did you do some research along these lines?
>>>>
>>>> *I've had a look at the things that you have mentioned in the guide. *
>>>>
>>>> If yes, what did you find?
>>>>
>>>> *I've not managed to test the queries out yet. But I am planning to
>>>> test out in my lab environment.*
>>>> *Sorry am really cautious about this as those are the main production
>>>> databases.*
>>>>
>>>> *Hope the above is going to give a bit of insight on the root cause of
>>>> the problem.*
>>>>
>>>>
>>>>
>>>> Yours,
>>>> Laurenz Albe
>>>>
>>>>
>>>>
>>>> On Wed, Jan 22, 2025 at 6:03 PM Adrian Klaver <
>>>> adrian(dot)klaver(at)aklaver(dot)com> wrote:
>>>>
>>>>> On 1/22/25 09:33, Paul Brindusa wrote:
>>>>> > Good afternoon,
>>>>> >
>>>>> > Following below we are facing a similar issue and im getting a real
>>>>> buzz
>>>>> > to get this working myself, speaking to my DBA in the company has
>>>>> > actually left me a bit cold as he is not good with postgres.
>>>>> >
>>>>> > So I want to try and get a solution for this and fix this issue with
>>>>> the
>>>>> > pg_wal files filling up the drive at a ridiculous rate. I have been
>>>>> > manually moving logs to a different directory but have had no luck
>>>>> in
>>>>> > finding an actual solution.
>>>>> >
>>>>> > The cluster is a 3 node cluster with HA which is running wirth
>>>>> patroni.
>>>>> >
>>>>> > Please help me out, I will mention that I have test cluster spun up
>>>>> in
>>>>> > case something needs testing.
>>>>> >
>>>>> > Also want to give a shout out to Lorenz Albe's for posting stuff
>>>>> about
>>>>> > wal files on his company blog.
>>>>> >
>>>>> > Again any help will be greatly appreciated.
>>>>>
>>>>> A good deal more information is needed to troubleshoot this:
>>>>>
>>>>> 1) Postgres version(s).
>>>>>
>>>>> 2) The Patroni version.
>>>>>
>>>>> 3) The Patroni configuration.
>>>>>
>>>>> 4) Definition of 'ridiculous rate'.
>>>>>
>>>>> 5) Relevant information from the logs.
>>>>>
>>>>> >
>>>>> >
>>>>> > " On one of our postgres instances we have the pg_wal/data folder up
>>>>> to
>>>>> > 196GB, out of 200GB disk filled up.
>>>>> > This has stopped the posgresql.service this morning causing two
>>>>> > applications to crash.
>>>>> > Unfortunately our database admin is on leave today, and we are
>>>>> trying to
>>>>> > figure out how to get the disk down?
>>>>> > Any ideas or suggestions are more than welcome.
>>>>> >
>>>>> > Thank you in advance."
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Kind Regards,
>>>>> > Paul Brindusa
>>>>> > paulbrindusa88(at)gmail(dot)com <mailto:paulbrindusa88(at)gmail(dot)com>
>>>>> >
>>>>>
>>>>> --
>>>>> Adrian Klaver
>>>>> adrian(dot)klaver(at)aklaver(dot)com
>>>>>
>>>>>
>>>>
>>>> --
>>>> Kind Regards,
>>>> Paul Brindusa
>>>> paulbrindusa88(at)gmail(dot)com
>>>>
>>>>
>>
>> --
>> Kind Regards,
>> Paul Brindusa
>> paulbrindusa88(at)gmail(dot)com
>>
>>

--
Kind Regards,
Paul Brindusa
paulbrindusa88(at)gmail(dot)com

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2025-01-23 16:07:26 Re: Return of the pg_wal issue..
Previous Message Saul Perdomo 2025-01-23 14:51:42 Re: Return of the pg_wal issue..