How to solve my slow disk i/o throughput during index scan

From: "FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)" <simon(dot)freyburger(at)sncf(dot)fr>
To: "pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: How to solve my slow disk i/o throughput during index scan
Date: 2024-07-04 13:25:44
Message-ID: MR1P264MB34732DC3639E0B6656B43DFD9DDE2@MR1P264MB3473.FRAP264.PROD.OUTLOOK.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello everyone,

I have a database of sufficient size that it does not fit entirely in RAM, including indexes that also exceed RAM capacity. Most of my big fact tables are following the same scheme :

* 3-5 id columns.
* Partitioned along one ID (ID_JOUR).
* 10 - 100 Go partitions.
* 1-5 Go primary key indexes, for each partition
* Contains 10 - 120 columns, some of them with a lot of NaNs. All of them are int or float8. Some columns are containing a lot of NaNs, because they were not all created and added to daily processing at the same date
* Requesting a small subset (100k-1M) lines from this database, always filtering on the primary key (I have a WHERE ... filtering on each ID column. Some are done through dimension tables + join, but I tried doing those directly, it did not solve my problem).
* See in the annex a DDL code for one of those tables.
* See the annex for the size of my different tables.
* Stable data (I am focusing on past data, rare to no insertions, VACUUM + CLUSTER done after those rare modifications).
* Played REINDEX, VACUUM, CLUSTER, ANALYZE on those tables.

When performing queries, I observe significant differences in processing time depending on whether the index needs to be read from disk or is already loaded in RAM. I think I have confirmed using EXPLAIN ANALYZE that the issue stems from index scans. See for example :

* https://explain.dalibo.com/plan/2c85077gagh98a17: very slow because some part of the index is "read" (from disk) and not "hit".
* https://explain.dalibo.com/plan/gfd20f8cadaa5261#plan/node/8 : 1M lines "instantaneously" (2 sec) retrieved, when the index is in RAM.
* https://explain.dalibo.com/plan/1b394hc5a26cf747 where I added set track_io_timing=TRUE.

I measured the speed of loading my index into RAM during a query, which is approximately 2 to 3 MB/s. However, my infrastructure theoretically supports I/O speed of around 900 MB/s.

On some older partitions, I was able to sometimes get better throughputs (see e.g. https://explain.dalibo.com/plan/4db409d1d6d95d4b)

I do not understand why reading my index from disk is so slow. I suspect that the index is not sequentially read, but I do not know how postgresql internals really behave, so this is just a supposition.

My question is : what can I change to get a better index reading speed ?

What I already tried :

* Setting random_page_cost to prohibitive value (10000000) to force a bitmap heap scan, because those can be made in parallel. This has not worked, the optimizer is still doing an index scan on my fact table.
* Change effective_io_concurrency, max_parallel_workers_per_gather, work_mem to much higher values.

Thank you in advance for your help, any idea/advice greatly appreciated !

Simon F.

ANNEX :

Some more details about my environment :

* I am working on Azure. My hardware are a E16s_v3 (see https://learn.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series) and P80 disks (https://azure.microsoft.com/en-us/pricing/details/managed-disks/)
* I was unable to run the hardware speed test, because I do not have sudo rights, but through VACUUM execution, the
* Postgresql version : PostgreSQL 12.15 (Ubuntu 12.15-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
* OS version :

==> /etc/lsb-release <==
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"

==> /etc/os-release <==
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04 LTS"
VERSION_ID="20.04"
HOME_URL=https://www.ubuntu.com/
SUPPORT_URL=https://help.ubuntu.com/
BUG_REPORT_URL=https://bugs.launchpad.net/ubuntu/
PRIVACY_POLICY_URL=https://www.ubuntu.com/legal/terms-and-policies/privacy-policy
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

==> /etc/SNCF_release <==
HOSTNAME : uzcordbr05
OS_NAME : UBUNTU 20.04
OS_DESCRIPTION : Ubuntu 20.04 LTS
OS_RELEASE : 20.04
OS_CODENAME : focal
CMDB_ENV : Recette
AENV : hprod
REPO_IT : https://repos.it.sncf.fr/repos/os/ubuntu/bugfix/dists/focal-bugfix/Release

* Default requests settings :
* effective_cache_size = '96GB',
* effective_io_concurrency = '200',
* max_parallel_workers_per_gather = '4',
* random_page_cost = '1.1',
* search_path = 'public',
* work_mem = '64MB' --> I tried to change work_mem to 4GB, did not change anything.

* Postgres custom configuration settings

name |current_setting |source |
--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
application_name |DBeaver 21.2.5 - SQLEditor <Script-19.sql> |session |
archive_command |(disabled) |configuration file |
archive_mode |off |configuration file |
archive_timeout |2h |configuration file |
autovacuum_analyze_scale_factor |0.05 |configuration file |
autovacuum_analyze_threshold |50 |configuration file |
autovacuum_max_workers |6 |configuration file |
autovacuum_naptime |15s |configuration file |
autovacuum_vacuum_cost_delay |10ms |configuration file |
autovacuum_vacuum_cost_limit |-1 |configuration file |
autovacuum_vacuum_scale_factor |0.01 |configuration file |
autovacuum_vacuum_threshold |50 |configuration file |
autovacuum_work_mem |512MB |configuration file |
checkpoint_completion_target |0.9 |configuration file |
client_encoding |UTF8 |client |
cluster_name |irdbr010 |configuration file |
DateStyle |ISO, DMY |client |
default_text_search_config |pg_catalog.french |configuration file |
effective_cache_size |96GB |configuration file |
effective_io_concurrency |200 |database |
extra_float_digits |3 |session |
lc_messages |C |configuration file |
lc_monetary |fr_FR.UTF8 |configuration file |
lc_numeric |fr_FR.UTF8 |configuration file |
lc_time |fr_FR.UTF8 |configuration file |
listen_addresses |* |configuration file |
log_autovacuum_min_duration |0 |configuration file |
log_checkpoints |on |configuration file |
log_connections |on |configuration file |
log_disconnections |off |configuration file |
log_file_mode |0640 |configuration file |
log_line_prefix |%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h |configuration file |
log_lock_waits |on |configuration file |
log_min_duration_statement |5s |configuration file |
log_min_error_statement |warning |configuration file |
log_statement |ddl |configuration file |
log_temp_files |0 |configuration file |
log_timezone |Europe/Paris |configuration file |
logging_collector |on |configuration file |
maintenance_work_mem |4GB |configuration file |
max_connections |500 |configuration file |
max_locks_per_transaction |1024 |configuration file |
max_parallel_workers_per_gather |4 |configuration file |
max_stack_depth |2MB |environment variable|
max_wal_size |4GB |configuration file |
min_wal_size |128MB |configuration file |
password_encryption |scram-sha-256 |configuration file |
pg_stat_statements.max |15000 |configuration file |
pg_stat_statements.save |off |configuration file |
pg_stat_statements.track |all |configuration file |
pg_stat_statements.track_utility|off |configuration file |
port |5433 |configuration file |
random_page_cost |1.1 |database |
restore_command |/home/postgres/admin/bin/pgbackrest --config=/etc/pgbackrest.conf --pg1-path=/home/postgres/data/irdbr010/systeme --stanza=rdb_backup archive-get %f "%p"|configuration file |
search_path |public, public, temporaire, dtm_2019 |session |
shared_buffers |32GB |configuration file |
ssl |on |configuration file |
statement_timeout |0 |user |
tcp_keepalives_count |10 |configuration file |
tcp_keepalives_idle |900 |configuration file |
tcp_keepalives_interval |75 |configuration file |
TimeZone |Europe/Paris |client |
unix_socket_group |postgres |configuration file |
unix_socket_permissions |0700 |configuration file |
wal_buffers |16MB |configuration file |
work_mem |64MB |configuration file |

DDL for my one of my table :

-- public."F_TDOJ_HIST_1" definition

-- Drop table

-- DROP TABLE public."F_TDOJ_HIST_1";

CREATE TABLE public."F_TDOJ_HIST_1" (
"ID_TRAIN" int4 NOT NULL,
"ID_JOUR" int4 NOT NULL,
"ID_OD" int4 NOT NULL,
"JX" int4 NOT NULL,
"RES" int4 NULL,
"REV" float8 NULL,
"OFFRE" int4 NULL,
"CC_OUV" int4 NULL,
"GENV_NUM" int8 NULL,
"GENV_DEN" int8 NULL,
"GENR_NUM" int8 NULL,
"GENR_DEN" int8 NULL,
"GENH_NUM" int8 NULL,
"GENH_DEN" int8 NULL,
"RES_CC0" int4 NULL,
"RES_CC1" int4 NULL,
"RES_CC2" int4 NULL,
"RES_CC3" int4 NULL,
"RES_CC4" int4 NULL,
"RES_CC5" int4 NULL,
"RES_CC6" int4 NULL,
"RES_CC7" int4 NULL,
"RES_CC8" int4 NULL,
"RES_CC9" int4 NULL,
"RES_CC10" int4 NULL,
"RES_CC11" int4 NULL,
"RES_CC12" int4 NULL,
"RES_CC13" int4 NULL,
"RES_CC14" int4 NULL,
"RES_CC15" int4 NULL,
"RES_CC16" int4 NULL,
"RES_CC17" int4 NULL,
"RES_CC18" int4 NULL,
"RES_CC19" int4 NULL,
"RES_CC20" int4 NULL,
"AUT_CC0" int4 NULL,
"AUT_CC1" int4 NULL,
"AUT_CC2" int4 NULL,
"AUT_CC3" int4 NULL,
"AUT_CC4" int4 NULL,
"AUT_CC5" int4 NULL,
"AUT_CC6" int4 NULL,
"AUT_CC7" int4 NULL,
"AUT_CC8" int4 NULL,
"AUT_CC9" int4 NULL,
"AUT_CC10" int4 NULL,
"AUT_CC11" int4 NULL,
"AUT_CC12" int4 NULL,
"AUT_CC13" int4 NULL,
"AUT_CC14" int4 NULL,
"AUT_CC15" int4 NULL,
"AUT_CC16" int4 NULL,
"AUT_CC17" int4 NULL,
"AUT_CC18" int4 NULL,
"AUT_CC19" int4 NULL,
"AUT_CC20" int4 NULL,
"DSP_CC0" int4 NULL,
"DSP_CC1" int4 NULL,
"DSP_CC2" int4 NULL,
"DSP_CC3" int4 NULL,
"DSP_CC4" int4 NULL,
"DSP_CC5" int4 NULL,
"DSP_CC6" int4 NULL,
"DSP_CC7" int4 NULL,
"DSP_CC8" int4 NULL,
"DSP_CC9" int4 NULL,
"DSP_CC10" int4 NULL,
"DSP_CC11" int4 NULL,
"DSP_CC12" int4 NULL,
"DSP_CC13" int4 NULL,
"DSP_CC14" int4 NULL,
"DSP_CC15" int4 NULL,
"DSP_CC16" int4 NULL,
"DSP_CC17" int4 NULL,
"DSP_CC18" int4 NULL,
"DSP_CC19" int4 NULL,
"DSP_CC20" int4 NULL,
"REV_CC0" float8 NULL,
"REV_CC1" float8 NULL,
"REV_CC2" float8 NULL,
"REV_CC3" float8 NULL,
"REV_CC4" float8 NULL,
"REV_CC5" float8 NULL,
"REV_CC6" float8 NULL,
"REV_CC7" float8 NULL,
"REV_CC8" float8 NULL,
"REV_CC9" float8 NULL,
"REV_CC10" float8 NULL,
"REV_CC11" float8 NULL,
"REV_CC12" float8 NULL,
"REV_CC13" float8 NULL,
"REV_CC14" float8 NULL,
"REV_CC15" float8 NULL,
"REV_CC16" float8 NULL,
"REV_CC17" float8 NULL,
"REV_CC18" float8 NULL,
"REV_CC19" float8 NULL,
"REV_CC20" float8 NULL,
"RES_CHD" int4 NULL,
"REV_CHD" float8 NULL,
"RES_PRO" int4 NULL,
"REV_PRO" float8 NULL,
"RES_SOC" int4 NULL,
"REV_SOC" float8 NULL,
"RES_TMX" int4 NULL,
"REV_TMX" float8 NULL,
"RES_MAX" int4 NULL,
"RES_GRP" int4 NULL,
"REV_GRP" float8 NULL,
"PREV_RES" int4 NULL,
"PREV_REV" float8 NULL,
"OPTIM_RES" int4 NULL,
"OPTIM_REV" float8 NULL,
"RES_FFX" int4 NULL,
"REV_FFX" float8 NULL,
"RES_SFE" int4 NULL,
"REV_SFE" float8 NULL,
"RES_SFN" int4 NULL,
"REV_SFN" float8 NULL,
"RES_NFE" int4 NULL,
"REV_NFE" float8 NULL,
"RES_NFN" int4 NULL,
"REV_NFN" float8 NULL,
"RES_ABO" int4 NULL,
"REV_ABO" float8 NULL,
"RES_AGN" int4 NULL,
"REV_AGN" float8 NULL,
"RES_BPR" int4 NULL,
"REV_BPR" float8 NULL,
"RES_LIB" int4 NULL,
"REV_LIB" float8 NULL,
"RES_FFN" int4 NULL,
"REV_FFN" float8 NULL,
"RES_PRI" int4 NULL,
"REV_PRI" float8 NULL,
CONSTRAINT "F_TDOJ_HIST_1_OLDP_pkey" PRIMARY KEY ("ID_TRAIN", "ID_JOUR", "ID_OD", "JX")
)
PARTITION BY RANGE ("ID_JOUR");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_JOUR_JX_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_JOUR", "JX");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_JOUR_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_JOUR");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_OD_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_OD");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_ID_TRAIN_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("ID_TRAIN");
CREATE INDEX "F_TDOJ_HIST_1_OLDP_JX_idx" ON ONLY public."F_TDOJ_HIST_1" USING btree ("JX");

-- public."F_TDOJ_HIST_1" foreign keys

ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_JOUR_fkey" FOREIGN KEY ("ID_JOUR") REFERENCES public."D_JOUR"("ID_JOUR");
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_OD_fkey" FOREIGN KEY ("ID_OD") REFERENCES public."D_OD"("ID_OD");
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_ID_TRAIN_fkey" FOREIGN KEY ("ID_TRAIN") REFERENCES public."D_TRAIN"("ID_TRAIN");
ALTER TABLE public."F_TDOJ_HIST_1" ADD CONSTRAINT "F_TDOJ_HIST_1_OLDP_JX_pkey" FOREIGN KEY ("JX") REFERENCES public."D_JX"("JX");

Result of :
SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname in (
'F_TDLJ_HIST_1', 'F_TDLJ_HIST_2', 'F_TDLJ_HIST', 'F_TDOJ_HIST_1', 'F_TDOJ_HIST_2', 'F_TDOJ_HIST'
);

relpages|reltuples|relname |relallvisible|relkind|relnatts|relhassubclass|reloptions|pg_table_size|
--------+---------+-------------+-------------+-------+--------+--------------+----------+-------------+
0| 0.0|F_TDLJ_HIST | 0|p | 47|true |NULL | 0|
95806| 442969.0|F_TDLJ_HIST | 95806|r | 47|false |NULL | 785080320|
0| 0.0|F_TDLJ_HIST_1| 0|p | 129|true |NULL | 0|
197458| 730226.0|F_TDLJ_HIST_1| 157954|r | 129|false |NULL | 1618059264|
0| 0.0|F_TDLJ_HIST_2| 0|p | 159|true |NULL | 0|
278359| 441524.0|F_TDLJ_HIST_2| 278359|r | 159|false |NULL | 2280972288|
0| 0.0|F_TDOJ_HIST | 0|p | 104|true |NULL | 0|
311913|1424975.0|F_TDOJ_HIST | 311913|r | 56|false |NULL | 2555928576|
0| 0.0|F_TDOJ_HIST_1| 0|p | 135|true |NULL | 0|
682522|1241940.0|F_TDOJ_HIST_1| 682522|r | 135|false |NULL | 5592793088|
0| 0.0|F_TDOJ_HIST_2| 0|p | 163|true |NULL | 0|
661324|1397598.0|F_TDOJ_HIST_2| 661324|r | 163|false |NULL | 5419098112|

Interne

-------
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
-------
This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andrei Lepikhov 2024-07-04 14:36:59 Re: How to solve my slow disk i/o throughput during index scan
Previous Message Tomas Vondra 2024-07-03 17:40:18 Re: Hash Right join and seq scan