Quick Links

Increasing parallelism of queries while using file fdw and partitions

From:	Patrick Mulrooney <mulroony(at)gmail(dot)com>
To:	pgsql-performance(at)lists(dot)postgresql(dot)org
Subject:	Increasing parallelism of queries while using file fdw and partitions
Date:	2018-12-19 04:39:36
Message-ID:	aea30799-809c-f027-1a4b-0c03c7efa3f6@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Wondering if anyone had any thoughts on how to tweak my setup to get it
to read many files at once instead of one at a time when using file fdw
and partitions. We have a bunch of data tied up in files (each file > 4M
rows, 5,000+ files per year) that I would like to be able to query
directly using FDW. The files are genomic VCF format and I find that
vcf_fdw ( https://github.com/ergo70/vcf_fdw ) works really well to read
the data. We only want to be able to search the data as quickly as
possible, no updates / deletes / ...

I gave an example below of the basic setup and the output of explain
analyze. I get the same performance if I setup the table such that the
thousands of files end up in one non-partitioned table or setup each
file as it's own partition of the table.

I have tried increasing ( / decreasing ) the worker threads and workers,
but don't see any change in the number of files open at any given time.
I tried reducing the cost of parallel queries to force them to run, but
can't get them to kick in.

Any ideas or anything I can try?

Thanks!

Pat

PostgreSQL: PostgreSQL 10.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28), 64-bit
Multicorn: 1.3.5
VCF_FDW ( https://github.com/ergo70/vcf_fdw ) : 1.0.0

CREATE DATABASE variants;

CREATE EXTENSION multicorn;

CREATE SERVER multicorn_vcf FOREIGN DATA WRAPPER multicorn OPTIONS (wrapper 'vcf_fdw.VCFForeignDataWrapper');

CREATE SCHEMA vcf;

CREATE TABLE vcf.variants ( ..., species text, ... ) PARTITION BY LIST ( species );

CREATE FOREIGN TABLE vcf.human ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'human', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.human FOR VALUES IN ( 'human' );

CREATE FOREIGN TABLE vcf.dog ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'dog', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.dog FOR VALUES IN ( 'dog' );

CREATE FOREIGN TABLE vcf.cat ( ... ) SERVER multicorn_vcf OPTIONS (basedir '/path', species 'cat', suffix '.vcf.gz');
ALTER TABLE vcf.variants ATTACH PARTITION vcf.cat FOR VALUES IN ( 'cat' );

* My real data repeats this 1000+ more times

EXPLAIN ( ANALYZE, BUFFERS ) SELECT * FROM vcf.variants WHERE chrom = '1' AND pos = 10120 LIMIT 1000;

On my real data I get the following results:
--------------------------
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=20.00..352020.00 rows=1000 width=347) (actual time=445.548..101709.307 rows=20 loops=1)
-> Append (cost=20.00..3555200000.00 rows=10100000 width=347) (actual time=445.547..101709.285 rows=20 loops=1)
-> Foreign Scan on dog (cost=20.00..3520000.00 rows=10000 width=352) (actual time=198.653..198.654 rows=0 loops=1)
Filter: ((chrom = '1'::text) AND (pos = 10120))
-> Foreign Scan on cat (cost=20.00..3520000.00 rows=10000 width=352) (actual time=111.840..111.840 rows=0 loops=1)
Filter: ((chrom = '1'::text) AND (pos = 10120))
-> Foreign Scan on human (cost=20.00..3520000.00 rows=10000 width=352) (actual time=135.050..138.534 rows=1 loops=1)
Filter: ((chrom = '1'::text) AND (pos = 10120))
... repeats many more times for each partition
Planning time: 613.815 ms
Execution time: 101873.880 ms
(2024 rows)

Responses

Re: Increasing parallelism of queries while using file fdw and partitions at 2018-12-19 06:51:49 from Justin Pryzby

Browse pgsql-performance by date

	From	Date	Subject
Next Message	anand086	2018-12-19 06:04:40	SQL Perfomance during autovacuum
Previous Message	Alvaro Herrera	2018-12-18 16:40:04	Re: Why Postgres doesn't use TID scan?