From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
Cc: | Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, Antonin Houska <ah(at)cybertec(dot)at> |
Subject: | Re: AIO v2.5 |
Date: | 2025-04-14 16:06:24 |
Message-ID: | 4qk3ehe6w7x7hfrldei2hefjcb7v7nfmj2owl2ir64craqcapz@kbrao22ljxeb |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2025-04-13 09:00:01 +0300, Alexander Lakhin wrote:
> 07.04.2025 22:10, Alexander Lakhin wrote:
> > > I ran it for a while in a VM, it hasn't triggered yet. Neither on xfs nor on
> > > tmpfs.
> >
> > Before sharing the script I tested it on two my machines, but I had
> > anticipated that the error can be hard to reproduce. Will try to reduce
> > the reproducer...
>
> I've managed to reduce it to the following:
Thanks a lot for working on that!
> [reproducer]
>
> It fails for me as below:
> iteration 13 (jobs: 25)
> Sun Apr 13 05:31:47 AM UTC 2025
> iteration 14 (jobs: 67)
> Sun Apr 13 05:31:50 AM UTC 2025
> dropdb: error: database removal failed: ERROR: could not read blocks 0..0 in file "global/1213": Operation canceled
> 2025-04-13 05:31:58.930 UTC [1153451] LOG: could not read blocks 0..0 in file "global/1213": Operation canceled
> 2025-04-13 05:31:58.930 UTC [1153451] CONTEXT: completing I/O on behalf of process 1153456
> 2025-04-13 05:31:58.930 UTC [1153451] STATEMENT: DROP DATABASE db5;
> 2025-04-13 05:31:58.930 UTC [1153456] ERROR: could not read blocks 0..0 in file "global/1213": Operation canceled
> 2025-04-13 05:31:58.930 UTC [1153456] STATEMENT: DROP DATABASE db6;
> 2025-04-13 05:31:58.931 UTC [1034758] LOG: checkpoint complete: wrote 3
> buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0
> recycled; write=0.002 s, sync=0.001 s, total=0.002 s; sync files=0,
> longest=0.000 s, average=0.000 s; distance=18 kB, estimate=458931 kB;
> lsn=16/54589E08, redo lsn=16/54586F88
> 2025-04-13 05:31:58.931 UTC [1034758] LOG: checkpoint starting: immediate force wait
Unfortunately I'm several hundred iterations in, without reproducing the
issue. I'm bad at statistics, but I think that makes it rather unlikely that I
will, without changing some aspect.
Was this an assert enabled build? What compiler and what optimization settings
did you use? Do you have huge pages configured (so that the default
huge_pages=try would end up with huge pages)?
So far I've been trying to use a cassert enabled build built with -O0, without
huge pages. After the current test run I'll switch to cassert+-O2.
> I reproduced this error on three different machines (all are running
> Ubuntu 24.04, two with kernel version 6.8, one with 6.11), with PGDATA
> located on tmpfs.
That's another variable to try - so far I've been trying this on 6.15.0-rc1
[1]. I guess I'll have to set up a ubuntu 24.04 VM and try with that.
Greetings,
Andres Freund
[1] I wanted to play with io_uring changes that were recently merged. Namely
support for readv/writev of "fixed" buffers. That avoids needing to pin/unpin
buffers while IO is ongoing, which turns out to be a noticeable bottleneck in
some workloads, particularly when using 1GB huge pages.
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2025-04-14 16:12:53 | Re: [PoC] Federated Authn/z with OAUTHBEARER |
Previous Message | Dimitrios Apostolou | 2025-04-14 16:02:40 | [WIP] Implement "pg_restore --data-only --clean" as a way to skip WAL |