Many DataFileRead - IO waits

From: Ben Snaidero <bensnaidero(at)geotab(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Many DataFileRead - IO waits
Date: 2020-02-27 16:33:25
Message-ID: CAEPE5bNNkFhoPRufSTrPZVA-W+bbu+LgXUQ1OzD9VwMowV5=VA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Running into a strange issue that just popped up on a few servers in my
environment and was wondering if the community had any insight into as to
what could be causing the issue.

First, a bit of background. I am running Postgres 10.11 on Windows (but
have seen similar issue on a server running 11.6)
Windows Version:
Major Minor Build Revision
----- ----- ----- --------
10 0 14393 0

I have the following query that was on average running in ~2ms suddenly
jump up to on average ~25ms. This query is called millions of time per day
and there were cases of the query taking 20-30 seconds. Below is the
explain analyze of one such example.
When seeing this issue, the server was under some CPU pressure but even
with that, I would not think it should get as slow as shown below as we are
using SSDs and none of the windows disk counters (IOPS, queue length) show
any value that would be of concern.

explain (analyze,buffers) SELECT
tabledata.uuid_id,tabledata.int_id,tabledata.timestamp_date,tabledata.int_otherid,tabledata.float_value,tabledata.int_otherid2,tabledata.int_otherid3,tabledata.int_rowver
FROM tabledata WHERE timestamp_date <= $1 AND int_otherid3 IN
($2,$3,$4,$5,$6,$7) AND tabledata.int_id=$8 ORDER BY timestamp_date DESC
LIMIT 1

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.71..139.67 rows=1 width=84) (actual
time=17719.076..17719.077 rows=1 loops=1)
Buffers: shared hit=12102 read=13259 written=111
-> Index Scan Backward using
ix_tabledata_intid_timestampdate_intotherid3_intotherid2 on tabledata
(cost=0.71..2112828.54 rows=15204 width=84) (actual
time=17719.056..17719.057 rows=1 loops=1)
Index Cond: ((int_id = 8149) AND (timestamp_date <=
'2020-02-24 03:05:00.013'::timestamp without time zone))
Filter: (int_otherid3 = ANY
('{3ad2b707-a068-42e8-b0f2-6c8570953760,4e1b1bfa-34e1-48df-8cf8-2b59caf076e2,00d394dd-c2f4-4f3a-a8d4-dc208dafa686,baa904a6-8302-4fa3-b8ae-8adce8fe4306,3c99d61b-21a1-42ea-92a8-3cc88d79f3f1,befe0f8b-5911-47b3-bfae-faa9f8b09d08}'::uuid[]))
Rows Removed by Filter: 91686
Buffers: shared hit=12102 read=13259 written=111
Planning time: 203.153 ms
Execution time: 17719.200 ms
(9 rows)

If I look at pg_stat_activity while the query is running all of the calls
to this query have the same wait event.
wait_event - DataFileRead
wait_event_type - IO

We took a perfview during the issue and below is the call stack from a
process running this query, two call paths are shown.
---------------------------------------------------------------

Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiObtainSystemCacheView
||+ ntoskrnl!MmMapViewInSystemCache
|| + ntoskrnl!CcGetVacbMiss
|| + ntoskrnl!CcGetVirtualAddress
|| + ntoskrnl!CcMapAndCopyFromCache
|| + ntoskrnl!CcCopyReadEx
|| + ntfs!NtfsCopyReadA
|| |+ fltmgr!FltpPerformFastIoCall
|| | + fltmgr!FltpPassThroughFastIo
|| | + fltmgr!FltpFastIoRead
|| | + ntoskrnl!NtReadFile
|| | + ntdll!NtReadFile
|| | + kernelbase!ReadFile
|| | + msvcr120!_read_nolock
|| | + msvcr120!_read
|| | + postgres!PathNameOpenFile
|| | + postgres!??mdclose
|| | + postgres!ScheduleBufferTagForWriteback
|| | + postgres!InitBufTable
|| | + postgres!??PrefetchBuffer
|| | |+ postgres!index_getnext_tid
|| | | + postgres!index_fetch_heap
|| | | + postgres!ExecIndexEvalRuntimeKeys
|| | | + postgres!ExecAssignScanProjectionInfoWithVarno
|| | | + postgres!tupledesc_match
|| | | + postgres!recompute_limits
|| | | + postgres!CheckValidRowMarkRel
|| | | + postgres!list_length
|| | | + pg_stat_statements!pgss_ExecutorRun
|| | | + postgres!PortalRunFetch
|| | | + postgres!PortalStart
|| | | + postgres!exec_bind_message
|| | | + postgres!PostgresMain
|| | | + postgres!BackendInitialize
|| | | + postgres!ClosePostmasterPorts
|| | | + postgres!main
|| | | + postgres!_onexit
|| | | + kernel32!BaseThreadInitThunk
|| | | + ntdll!RtlUserThreadStart

Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!MiChargeWsles
|+ ntoskrnl!MiReleaseSystemCacheView
| + ntoskrnl!MmUnmapViewInSystemCache
| + ntoskrnl!CcUnmapVacb
| + ntoskrnl!CcUnmapVacbArray
| + ntoskrnl!CcGetVirtualAddress
| + ntoskrnl!CcMapAndCopyFromCache
| + ntoskrnl!CcCopyReadEx
| + ntfs!NtfsCopyReadA
| |+ fltmgr!FltpPerformFastIoCall
| | + fltmgr!FltpPassThroughFastIo
| | + fltmgr!FltpFastIoRead
| | + ntoskrnl!NtReadFile
| | + ntdll!NtReadFile
| | |+ kernelbase!ReadFile
| | | + msvcr120!_read_nolock
| | | + msvcr120!_read
| | | + postgres!PathNameOpenFile
| | | + postgres!??mdclose
| | | + postgres!ScheduleBufferTagForWriteback
| | | + postgres!InitBufTable
| | | + postgres!??PrefetchBuffer
| | | |+ postgres!index_getnext_tid
| | | | + postgres!index_fetch_heap
| | | | + postgres!ExecIndexEvalRuntimeKeys
| | | | + postgres!ExecAssignScanProjectionInfoWithVarno
| | | | + postgres!tupledesc_match
| | | | + postgres!recompute_limits
| | | | + postgres!CheckValidRowMarkRel
| | | | + postgres!list_length
| | | | + pg_stat_statements!pgss_ExecutorRun
| | | | + postgres!PortalRunFetch
| | | | + postgres!PortalStart
| | | | + postgres!exec_bind_message
| | | | + postgres!PostgresMain
| | | | + postgres!BackendInitialize
| | | | + postgres!ClosePostmasterPorts
| | | | + postgres!main
| | | | + postgres!_onexit
| | | | + kernel32!BaseThreadInitThunk
| | | | + ntdll!RtlUserThreadStart

If I do a top down (ie from when the process started where did we spend the
most time) I get:
Name
ROOT
+ Process64 postgres (16668) Args: "--forkbackend" "43216"
+ Thread (16672) CPU=9399ms
|+ ntdll!RtlUserThreadStart
||+ kernel32!BaseThreadInitThunk
|| + postgres!_onexit
|| + postgres!main
|| + postgres!ClosePostmasterPorts
|| + postgres!BackendInitialize
|| + postgres!PostgresMain
|| + postgres!exec_bind_message
|| + postgres!PortalStart
|| + postgres!PortalRunFetch
|| + pg_stat_statements!pgss_ExecutorRun
|| + postgres!list_length
|| + postgres!CheckValidRowMarkRel
|| + postgres!recompute_limits
|| + postgres!tupledesc_match
|| + postgres!ExecAssignScanProjectionInfoWithVarno
|| |+ postgres!ExecIndexEvalRuntimeKeys
|| ||+ postgres!index_fetch_heap
|| || + postgres!index_getnext_tid
|| || |+ postgres!??PrefetchBuffer
|| || ||+ postgres!InitBufTable
|| || |||+ postgres!ScheduleBufferTagForWriteback
|| || ||||+ postgres!??mdclose
|| || |||||+ postgres!PathNameOpenFile
|| || ||||||+ msvcr120!_read
|| || |||||| + msvcr120!_read_nolock
|| || |||||| |+ kernelbase!ReadFile
|| || |||||| ||+ ntdll!NtReadFile
|| || |||||| || + ntoskrnl!NtReadFile
|| || |||||| || |+ fltmgr!FltpFastIoRead
|| || |||||| || ||+ fltmgr!FltpPassThroughFastIo
|| || |||||| || |||+ fltmgr!FltpPerformFastIoCall
|| || |||||| || ||||+ ntfs!NtfsCopyReadA
|| || |||||| || |||| + ntoskrnl!CcCopyReadEx
|| || |||||| || |||| |+ ntoskrnl!CcMapAndCopyFromCache
|| || |||||| || |||| | + ntoskrnl!CcGetVirtualAddress
|| || |||||| || |||| | |+ ntoskrnl!CcUnmapVacbArray
|| || |||||| || |||| | ||+ ntoskrnl!CcUnmapVacb
|| || |||||| || |||| | |||+ ntoskrnl!MmUnmapViewInSystemCache
|| || |||||| || |||| | ||| +
ntoskrnl!ExAcquireSpinLockExclusive
|| || |||||| || |||| | ||| |+
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire

Also from this same perfview the following looks to be from the checkpoint
process waiting on the same lock

Name
ntoskrnl!ExpWaitForSpinLockExclusiveAndAcquire
+ ntoskrnl!ExAcquireSpinLockExclusive
+ ntoskrnl!MiAcquireProperVm
+ ntoskrnl!MiTrimSharedPageFromViews
+ ntoskrnl!MiTrimSection
+ ntoskrnl!MmTrimSection
+ ntoskrnl!CcCoherencyFlushAndPurgeCache
+ ntfs!NtfsFlushUserStream
+ ntfs!NtfsPerformOptimisticFlush
|+ ntfs!NtfsCommonFlushBuffers
| + ntfs!NtfsCommonFlushBuffersCallout
| + ntoskrnl!KeExpandKernelStackAndCalloutInternal
| + ntfs!NtfsCommonFlushBuffersOnNewStack
| + ntfs!NtfsFsdFlushBuffers
| + fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted
| + fltmgr!FltpDispatch
| + ntoskrnl!IopSynchronousServiceTail
| + ntoskrnl!NtFlushBuffersFileEx
| + ntoskrnl!NtFlushBuffersFile
| + ntdll!NtFlushBuffersFile
| |+ kernelbase!FlushFileBuffers
| | + msvcr120!_commit
| | + postgres!FileClose
| | + postgres!mdtruncate
| | + postgres!??ReleaseBuffer
| | + postgres!CreateCheckPoint
| | + postgres!CheckpointerMain
| | + postgres!AuxiliaryProcessMain
| | + postgres!MaxLivePostmasterChildren
| | + postgres!main
| | + postgres!_onexit
| | + kernel32!BaseThreadInitThunk
| | + ntdll!RtlUserThreadStart

In order to get by we increased the shared_buffers from 500MB to 50GB on
this server (and 10GB on another server) but in my opinion this is just
masking the issue. Was wondering if anyone in the community has seen
contention with this lock before or has other any insights as to why we
would suddenly run into this issue?

Ben Snaidero
*Geotab*
Senior Database Specialist
Direct +1 (289) 230-7749
Toll-free +1 (877) 436-8221
Visit www.geotab.com
Twitter <https://twitter.com/geotab> | Facebook
<https://www.facebook.com/Geotab> | YouTube
<https://www.youtube.com/user/MyGeotab> | LinkedIn
<https://www.linkedin.com/company/geotab/>

Join us at Connect 2020

San Diego

January 13 - 16, 2020

Register Now! <https://www.geotab.com/connect/>

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Lewis 2020-02-27 16:54:04 Re: Many DataFileRead - IO waits
Previous Message Pavel Stehule 2020-02-27 15:09:44 Re: proposal: schema variables