From: | Erik Rijkers <er(at)xs4all(dot)nl> |
---|---|
To: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com> |
Subject: | Re: snapbuild woes |
Date: | 2017-05-09 17:54:37 |
Message-ID: | 84a48a73304ed5363433ec4b9fbb4438@xs4all.nl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2017-05-09 11:50, Petr Jelinek wrote:
> I rebased the above mentioned patch to apply to the patches Andres
> sent,
> if you could try to add it on top of what you have and check if it
> still
> fails, that would be helpful.
It still fails.
With these patches
- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch
built again on top of 44c528810a1 ( so I had to add the
'fix-statistics-rep*' patch because without it I immediately got that
Assertion failure again ).
As always most runs succeed (especially on this large 192GB 16-core
server).
But attached is an output file of a number of runs of my
pgbench_derail2.sh test.
Overal result:
-- out_20170509_1635.txt
3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25
2 -- All is well.
1 -- Not good, but breaking out of wait (21 times no change)
I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I
did) - not interesting.
iteration 2 succeeds. (eventually has 'replica ok')
iteration 3 succeeds. (eventually has 'replica ok')
iteration 4 fails.
Just after 'alter subscription sub1 enable' I caught (as is usual)
pg_stat_replication.state as 'catchup'. So far so good.
After the 15-minute pgbench run pg_stat_replication has only 2
'startup' lines (and none 'catchup' or 'streaming'):
port | pg_stat_replication | pid | wal | replay_loc | diff |
?column? | state | app | sync_state
6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | |
| startup | derail2 | async
6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | |
| startup | derail2 | async
(that's from:
select $port1 as port,'pg_stat_replication' as pg_stat_replication,
pid
, pg_current_wal_location() wal, replay_location replay_loc,
pg_current_wal_location() - replay_location as diff
, pg_current_wal_location() <= replay_location
, state, application_name as app, sync_state
from pg_stat_replication
)
This remains in this state for as long as my test-programs lets it
(i.e., 20 x 30s, or something like that, and then the loop is exited);
in the ouput file it says: 'Not good, but breaking out of wait'
Below is the accompanying ps (with the 2 'deranged senders' as Jeff
Janes would surely call them):
UID PID PPID C STIME TTY STAT TIME CMD
rijkers 107147 1 0 17:11 pts/35 S+ 0:00
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres:
checkpointer process
rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres:
writer process
rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal
writer process
rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats
collector process
rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres:
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers 107145 1 0 17:11 pts/35 S+ 0:02
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
-D /var/data1/pg_stuff/pg_installations
rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres:
logger process
rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres:
checkpointer process
rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres:
writer process
rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal
writer process
rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres:
autovacuum launcher process
rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats
collector process
rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres:
bgworker: logical replication launcher
rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal
sender process rijkers [local] idle
rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal
sender process rijkers [local] idle
I have had no time to add (or view) any CPUinfo.
Erik Rijkers
Attachment | Content-Type | Size |
---|---|---|
out_20170509_1635.txt | application/x-elc | 15.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2017-05-09 18:01:35 | Re: proposal psql \gdesc |
Previous Message | Masahiko Sawada | 2017-05-09 17:46:50 | Re: Get stuck when dropping a subscription during synchronizing table |