From: | Aleksander Alekseev <aleksander(at)timescale(dot)com> |
---|---|
To: | "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: base backup vs. concurrent truncation |
Date: | 2023-04-21 21:19:36 |
Message-ID: | CAJ7c6TOt28F-uhMgoYOa_TXvgzvpVtVJWPC7kBraWT_FrGDGMA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
> > Oh, I see. If the process will be killed this perhaps is not going to
> > happen. Whether this can happen if we pull the plug from the machine
> > is probably a design implementation of the particular filesystem and
> > whether it's journaled.
>
> Right. I mentioned that scenario in the original email:
>
> "Furthermore, I think that the problem could arise without performing a
> backup at all: say that the server crashes on the OS level in
> mid-truncation, and the truncation of segment 0 reaches disk but the
> removal of segment 1 does not."
Right. I didn't fully understand this scenario at first.
I tried to reproduce it to see what actually happens. Here is what I did.
```
create table truncateme(id integer, val varchar(1024));
alter table truncateme set (autovacuum_enabled = off);
-- takes ~30 seconds
insert into truncateme
select id,
(
select string_agg(chr((33+random()*(126-33)) :: integer), '')
from generate_series(1,1000)
)
from generate_series(1,2*1024*1024) as id;
checkpoint;
select relfilenode from pg_class where relname = 'truncateme';
relfilenode
-------------
32778
```
We got 3 segments (and no VM fork since I disabled VACUUM):
```
$ cd ~/pginstall/data-master/base/16384/
$ ls -lah 32778*
-rw------- 1 eax staff 1.0G Apr 21 23:47 32778
-rw------- 1 eax staff 1.0G Apr 21 23:48 32778.1
-rw------- 1 eax staff 293M Apr 21 23:48 32778.2
-rw------- 1 eax staff 608K Apr 21 23:48 32778_fsm
```
Let's backup the last segment:
```
$ cp 32778.2 ~/temp/32778.2_backup
```
Truncate the table:
```
delete from truncateme where id > 1024*1024;
vacuum truncateme;
```
And kill Postgres:
```
$ pkill -9 postgres
```
... before it finishes another checkpoint:
```
LOG: checkpoints are occurring too frequently (4 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".
LOG: checkpoint starting: wal
[and there was no "checkpoint complete"]
```
Next:
```
$ ls -lah 32778*
-rw------- 1 eax staff 1.0G Apr 21 23:50 32778
-rw------- 1 eax staff 146M Apr 21 23:50 32778.1
-rw------- 1 eax staff 0B Apr 21 23:50 32778.2
-rw------- 1 eax staff 312K Apr 21 23:50 32778_fsm
-rw------- 1 eax staff 40K Apr 21 23:50 32778_vm
$ cp ~/temp/32778.2_backup ./32778.2
```
I believe this simulates the case when pg_basebackup did a checkpoint,
copied 32778.2 before the rest of the segments, Postgres did a
truncate, and pg_basebackup received the rest of the data including
WAL. The WAL contains the record about the truncation, see
RelationTruncate(). Just as an observation: we keep zero sized
segments instead of deleting them.
So if I start Postgres now I expect it to return to a consistent
state, ideally the same state it had before the crash in terms of the
segments.
What I actually get is:
```
LOG: database system was interrupted; last known up at 2023-04-21 23:50:08 MSK
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 9/C4035B18
LOG: invalid record length at 9/E8FCBF10: expected at least 24, got 0
LOG: redo done at 9/E8FCBEB0 system usage: CPU: user: 1.58 s, system:
0.96 s, elapsed: 2.61 s
LOG: checkpoint starting: end-of-recovery immediate wait
LOG: checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s)
added, 0 removed, 36 recycled; write=0.005 s, sync=0.003 s,
total=0.016 s; sync files=9, longest=0.002 s, average=0.001 s;
distance=605784 kB, estimate=605784 kB; lsn=9/E8FCBF10, redo
lsn=9/E8FCBF10
LOG: database system is ready to accept connections
```
... and:
```
-rw------- 1 eax staff 1.0G Apr 21 23:50 32778
-rw------- 1 eax staff 146M Apr 21 23:53 32778.1
-rw------- 1 eax staff 0B Apr 21 23:53 32778.2
```
Naturally the database is consistent too. So it looks like the
recovery protocol worked as expected after all, at least in the
upcoming PG16 and this specific scenario.
Did I miss anything? If not, perhaps we should just update the comments a bit?
--
Best regards,
Aleksander Alekseev
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2023-04-21 21:25:26 | Re: Order changes in PG16 since ICU introduction |
Previous Message | Andrew Gierth | 2023-04-21 21:08:59 | Re: Order changes in PG16 since ICU introduction |