| From: | Erik Rijkers <er(at)xs4all(dot)nl> | 
|---|---|
| To: | Andres Freund <andres(at)anarazel(dot)de> | 
| Cc: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com> | 
| Subject: | Re: snapbuild woes | 
| Date: | 2017-05-08 22:03:12 | 
| Message-ID: | 076a3e4c3d4fef091522d3cd7e50c7eb@xs4all.nl | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
On 2017-05-05 02:00, Andres Freund wrote:
> 
> Could you have a look?
Running tests with these three patches:
> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
> fix-statistics-reporting-in-logical-replication-work.patch
     (on top of 44c528810)
I test by 15-minute pgbench runs while there is a logical replication 
connection. Primary and replica are on the same machine.
I have seen errors on 3 different machines (where error means: at least 
1 of the 4 pgbench tables is not md5-equal). It seems better, faster 
machines yield less errors.
Normally I see in pg_stat_replication (on master) one process in state 
'streaming'.
  pid  |     wal     | replay_loc  |   diff   |   state   |   app   | 
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | 
async
Often there are another two processes in pg_stat_replication that remain 
in state 'startup'.
In the failing sessions the 'streaming'-state process is missing; in 
failing sessions there are only the two processes that are and remain in 
'startup'.
FWIW, below the output of a succesful and a failed run:
------------------ successful run:
creating tables...
1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s)
2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' 
publication pub1 with (disabled);
alter subscription sub1 enable;
-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785
progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440
progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130
progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841
progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 38175
latency average = 2128.606 ms
latency stddev = 3948.634 ms
tps = 42.151205 (including connections establishing)
tps = 42.151589 (excluding connections establishing)
-- waiting 0s... (always)
  port | pg_stat_replication | pid  |     wal     | replay_loc  | diff | 
?column? |   state   |   app   | sync_state
  6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 |    0 | 
t        | streaming | derail2 | async
2017.05.08 23:19:22
-- getting md5 (cb)
6972 a,b,t,h:  2500000     25    250  38175  b2ba48b53 b3788a837 
d1afac950 d4abcc72e master
6973 a,b,t,h:  2500000     25    250  38175  b2ba48b53 b3788a837 
d1afac950 d4abcc72e replica ok  bee2312c7
2017.05.08 23:20:48
  port | pg_stat_replication | pid  |     wal     | replay_loc  |   diff  
  | ?column? |   state   |   app   | sync_state
  6972 | pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 | 
95357856 | f        | streaming | derail2 | async
------------------------------------
------------------ failure:
creating tables...
1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s)
2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' 
publication pub1 with (disabled);
alter subscription sub1 enable;
  port | pg_stat_replication |  pid  |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state
  6972 | pg_stat_replication | 11945 | 18/5E2913D0 |            |      |  
         | catchup | derail2 | async
-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815
progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231
progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400
progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066
progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 58846
latency average = 1378.259 ms
latency stddev = 2304.159 ms
tps = 65.224168 (including connections establishing)
tps = 65.224788 (excluding connections establishing)
-- waiting 0s... (always)
  port | pg_stat_replication |  pid  |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state
  6972 | pg_stat_replication | 11948 | 18/7469A038 |            |      |  
         | startup | derail2 | async
  6972 | pg_stat_replication | 12372 | 18/7469A038 |            |      |  
         | startup | derail2 | async
------------------------------------
During my tests, I keep an eye on pg_stat_replication (refreshing every 
2s), and when I see those two 'startup'-lines in pg_stat_replication 
without any 'streaming'-line I know the test is going to fail.  I 
believe this has been true for all failure cases that I've seen (except 
the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another 
thread).
Sorry, I have not been able to get any thing more clear or definitive...
thanks,
Erik Rijkers
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Masahiko Sawada | 2017-05-09 00:32:01 | Re: Logical replication - TRAP: FailedAssertion in pgstat.c | 
| Previous Message | Petr Jelinek | 2017-05-08 21:55:29 | Re: logical replication syntax (was DROP SUBSCRIPTION, query cancellations and slot handling) |