From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz |
Subject: | Re: [HACKERS] WAL logging problem in 9.4.3? |
Date: | 2020-03-21 22:49:20 |
Message-ID: | 20200321224920.GB1763544@rfd.leadboat.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments. However, a diverse minority of buildfarm members are failing
> like this, in most branches:
>
> Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 # got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
>
> Since I run two of the failing animals, I expect to reproduce this soon.
force_parallel_regress was the setting needed to reproduce this:
printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf
make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf
The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up
the attached patch to understand the scope of the problem (not to commit). It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL(). Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them. You can see the lower-volume calls
like this:
printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode = regress' >/tmp/minimal_parallel.conf
make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
find . -name log | xargs grep -rl 'nm0 invalid'
Not all are actual bugs. For example, get_relation_info() behaves fine:
/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())
Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?
Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC. That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in. I will
leave the three smaller patches in place.
> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in
the test should fix this.
Attachment | Content-Type | Size |
---|---|---|
debug-parallel-skip-wal-v0.patch | text/plain | 5.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2020-03-21 23:22:41 | Re: Refactor compile-time assertion checks for C/C++ |
Previous Message | Bruce Momjian | 2020-03-21 22:13:03 | Re: Ecpg dependency |