| From: | pf(at)pfortin(dot)com |
|---|---|
| To: | |
| Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
| Subject: | [SOLVED?] Re: Disk wait problem... not hardware... |
| Date: | 2023-10-29 13:21:46 |
| Message-ID: | 20231029092146.53351f7e@pfortin.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
On Sat, 28 Oct 2023 18:34:50 -0400 Jim Mlodgenski wrote:
Looking like a GOLD star for Jim...
>On Fri, Oct 27, 2023 at 7:46 PM <pf(at)pfortin(dot)com> wrote:
>
>> Memory: 125.5 GiB of RAM
>>
>It looks like you have a large amount of memory allocated to the server
>
>But your plans are doing reads instead of pulling things from shared
>buffers
Assuming you are referring to "Buffers: shared hit=248 read=25022" ?
>>explain (analyze, buffers) select count(*) from ncvhis_2016_12_03;
>> Finalize Aggregate (cost=404669.65..404669.66 rows=1 width=8) (actual
>> time=844.158..847.309 rows=1 loops=1) Buffers: shared hit=248 read=25022
>> -> Gather (cost=404669.43..404669.65 rows=2 width=8) (actual
>> time=844.133..847.301 rows=3 loops=1) Workers Planned: 2
>> Workers Launched: 2
>> Buffers: shared hit=248 read=25022
>> -> Partial Aggregate (cost=403669.43..403669.45 rows=1
>> width=8) (actual time=838.772..838.772 rows=1 loops=3) Buffers: shared
>> hit=248 read=25022 -> Parallel Index Only Scan using
>> ncvhis_2016_12_03_voted_party_cd_idx on ncvhis_2016_12_03
>> (cost=0.44..372735.05 rows=12373755 width=0) (actual
>> time=18.277..592.473 rows=9900389 loops=3) Heap Fetches: 0 Buffers:
>> shared hit=248 read=25022 Planning Time: 0.069 ms JIT:
>> Functions: 8
>> Options: Inlining false, Optimization false, Expressions true,
>> Deforming true Timing: Generation 0.284 ms, Inlining 0.000 ms,
>> Optimization 0.268 ms, Emission 3.590 ms, Total 4.143 ms Execution
>> Time: 847.498 ms
>>
>>
>
>
>data/postgresql.conf:
>> max_connections = 100
>> shared_buffers = 128MB
I semi-misquoted; it was running on 128MB when first reported; then 512MB
when I sent this message.
>It looks like you are running with the stock config for shared_buffers.
>With only 128MB dedicated for shared_buffers and such a big database,
>you'll be thrashing the cache pretty hard. With 125GB on the server, try
>upping shared_buffers to something more like 32GB.
WOW! Some tables now have a barely noticeable delay; on the whole,
looking MUCH better. Testing with a just created copy[1] on the NVMe,
it's better; but it's slowly going back to longer delays. Looks like
count(*) may have easily-missed side-effects (more below). My team
members were suggesting a possible motherboard issue since the problem
happened on both the NVMe M.2 SSD, and on a backup SATA-3 platter drive.
Your suggestion arrived just in time as I was about to wipe out the SSD
and re-import everything fresh.
Another thing that bothers me is that when I copied the DB, the:
cp -a ./var ./var2 # same drive, same partition
replicating the DB onto free space on the same SSD, it was experiencing
slow reads 18-20KB/sec; the writes were quick, large blocks; so I was
suspecting something with the SSD... System buffers..?
How autoload row count (or count(*)) can hurt:
The queries which triggered this issue are all the same (WB built-in):
SELECT count(*) FROM table; # WB = SQL-workbench/J
These are all static tables. Does PG maintain a table row count so as to
avoid having to count each time? WB is setup to:
* autoload table row count
* autoload table data (restricted with LIMIT)
Walking through these 8M & 33M row tables (3.8GB & 5.2GB respectively)
must be filling the shared_buffers and thrashing even 32GB...
Ah-ha! Turning off "autoload table row count" has "resolved" this issue. I
can now scroll through all the tables at high speed (hold down the arrow
key to walk through all the tables. Not very useful; but walking through
the tables with only 128MB shared-buffers would have been the ultimate
buffer thrashing... ;p Even walking through the tables to collect their
row counts (whether manually or programmatically. ).
That said, the 32GB shared-buffers may now be masking NVMe drive issues...
Thanks!!!,
Pierre
Time and complexity to find the cause of an issue is inversely
proportional to the simplicity of the issue...
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Peter J. Holzer | 2023-10-29 15:00:46 | Re: Disk wait problem... may not be hardware... |
| Previous Message | Peter J. Holzer | 2023-10-29 10:49:11 | Re: pg_checksums? |