query not scaling

From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: query not scaling
Date: 2017-10-25 20:59:27
Message-ID: 6f7d4fd4-aed2-2fae-f551-d0f67196da33@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have a query I cannot tame and I'm wondering if there's an alternative
to the "between" clause I'm using. Perhaps a custom type could do
better? I've tried the "<@" orperator and that changes the query plan
significantly but the execution cost/time is not improved.

Any suggestion or pointers much appreciated.

Environment: Using a virtual CentOS Linux release 7.4.1708 (Core), 4
cores (2.3GHz), 8G RAM and postgres 10.0(beta3) shared_buffers = 1GB,
work_mem = 2GB

Domain: (TL/DR) A "segment" is defined by a particular set of people
(probandset.id) plus a subset of markers (markerset.id, startmarker,
endmarker). I need the minimum p-value for each marker in the set
across all segments matching the set and a specific set of poeple. So a
given segment says "I cover all the markers from startbase to endbase"
and each marker has a specific base position (relative to a chromosome).
I'm after the smallest p-value for each marker across the set of
segments which include that marker (from the 'between' clause).

Context: I have the query in a function so the ids of the all the
players are available to the following sql:

select m.id as mkrid
, min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval
from marker m join segment s on m.basepos between s.startbase and s.endbase
and m.chrom = 1
and s.chrom = 1
and s.markerset_id = suppliedMarkersetId
join probandset r on s.probandset_id = r.id
and r.people_id = suppliedPeopleId
group by m.id

where the pv function is

create or replace function pv(l bigint, e bigint, g bigint, o int)
returns numeric
as
$$
select 1.0*(g+e+o)/(l+e+g+o);
$$
language sql
;

I have the identical schema in two databases (same pg instance) and the
tables definitions involved are below. In one schema there are 1.7M
records in segment and in the other there is 40M rows. The marker
tables are much more similar with 600K and 900K respectively. The third
table, probandset, has 60 and 600 respectively. On average 0.8M and 1.8M
segments per markerset_id.

The explains: (fast (12sec), then slow(hours)). The part which sticks
out to me is where the "between" gets used. (I'm betting that
probandset is too small to matter.) The slower explain plan is very
similar to what I saw originally in the now "fast" data set and the
current indexing stategy comes largely from that performance work.

It looks like I'm getting a Cartesian between the number of markers in a
set and the number of segments found: ten zeros at least.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=291472.27..292040.58 rows=56831 width=48)
Group Key: m.id
-> Nested Loop (cost=3752.33..167295.52 rows=4515518 width=40)
-> Nested Loop (cost=3751.90..17906.25 rows=715 width=32)
-> Seq Scan on probandset r (cost=0.00..2.77 rows=4 width=16)
Filter: (people_id = '4e3b9829-43a8-4f84-9df6-f120dc5b1a7e'::uuid)
-> Bitmap Heap Scan on segment s (cost=3751.90..4473.96 rows=191 width=48)
Recheck Cond: ((probandset_id = r.id) AND (chrom = 1) AND (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid))
-> BitmapAnd (cost=3751.90..3751.90 rows=191 width=0)
-> Bitmap Index Scan on useg (cost=0.00..72.61 rows=2418 width=0)
Index Cond: ((probandset_id = r.id) AND (chrom = 1))
-> Bitmap Index Scan on segment_markerset_id_idx (cost=0.00..3676.23 rows=140240 width=0)
Index Cond: (markerset_id = '61a7e5cb-b81d-42e4-9e07-6bd9c2fbe6d1'::uuid)
-> Index Scan using marker_chrom_basepos_idx on marker m (cost=0.42..145.79 rows=6315 width=20)
Index Cond: ((chrom = 1) AND (basepos >= s.startbase) AND (basepos <= s.endbase))
(15 rows)

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=83131331.81..83132151.44 rows=81963 width=48)
Group Key: m.id
-> Nested Loop (cost=1907.38..70802659.35 rows=448315362 width=40)
Join Filter: ((m.basepos >= s.startbase) AND (m.basepos <= s.endbase))
-> Bitmap Heap Scan on marker m (cost=1883.64..11009.18 rows=81963 width=20)
Recheck Cond: (chrom = 1)
-> Bitmap Index Scan on marker_chrom_basepos_idx (cost=0.00..1863.15 rows=81963 width=0)
Index Cond: (chrom = 1)
-> Materialize (cost=23.74..181468.38 rows=49228 width=32)
-> Hash Join (cost=23.74..181222.24 rows=49228 width=32)
Hash Cond: (s.probandset_id = r.id)
-> Index Scan using segment_markerset_id_idx on segment s (cost=0.56..178022.70 rows=251881 width=48)
Index Cond: (markerset_id = 'afad387e-a8e7-4c3b-9adb-3f00e70a13b3'::uuid)
Filter: (chrom = 1)
-> Hash (cost=21.68..21.68 rows=120 width=16)
-> Seq Scan on probandset r (cost=0.00..21.68 rows=120 width=16)
Filter: (people_id = 'b124acb2-e400-4e63-8010-5a5f25a78f1c'::uuid)
(17 rows)

Table "base.marker"
Column | Type | Modifiers
---------+----------------+-----------
id | uuid | not null
name | text | not null
chrom | integer | not null
basepos | integer | not null
alleles | character(1)[] |
Indexes:
"marker_pkey" PRIMARY KEY, btree (id)
"marker_name_key" UNIQUE CONSTRAINT, btree (name)
"marker_basepos_idx" btree (basepos)
"marker_chrom_basepos_idx" btree (chrom, basepos)
Referenced by:
TABLE "markerset_member" CONSTRAINT "markerset_member_member_id_fkey" FOREIGN KEY (member_id) REFERENCES marker(id)

Table "aut.segment"
Column | Type | Modifiers
----------------+---------+--------------------
id | uuid | not null
chrom | integer | not null
markerset_id | uuid | not null
probandset_id | uuid | not null
startbase | integer | not null
endbase | integer | not null
firstmarker | integer | not null
lastmarker | integer | not null
events_less | bigint | not null default 0
events_equal | bigint | not null default 0
events_greater | bigint | not null default 0
Indexes:
"segment_pkey" PRIMARY KEY, btree (id)
"useg" UNIQUE CONSTRAINT, btree (probandset_id, chrom, startbase)
"segment_markerset_id_chrom_firstmarker_idx" btree (markerset_id, chrom, firstmarker)
"segment_markerset_id_idx" btree (markerset_id)
"segment_startbase_idx" btree (startbase)
Foreign-key constraints:
"segment_markerset_id_fkey" FOREIGN KEY (markerset_id) REFERENCES markerset(id)
"segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id)
Referenced by:
TABLE "segmentset_member" CONSTRAINT "segmentset_member_segment_id_fkey" FOREIGN KEY (segment_id) REFERENCES segment(id)

Table "aut.probandset"
Column | Type | Modifiers
-------------+------------------+-----------
id | uuid | not null
name | text |
probands | uuid[] | not null
meioses | integer |
min_kincoef | double precision |
max_kincoef | double precision |
people_id | uuid | not null
Indexes:
"probandset_pkey" PRIMARY KEY, btree (id)
"probandsetunique" gin (probands)
Check constraints:
"sortedset" CHECK (issorteduuids(probands))
Foreign-key constraints:
"probandset_people_id_fkey" FOREIGN KEY (people_id) REFERENCES people(id)
Referenced by:
TABLE "probandset_group_member" CONSTRAINT "probandset_group_member_member_id_fkey" FOREIGN KEY (member_id) REFERENCES probandset(id)
TABLE "segment" CONSTRAINT "segment_probandset_id_fkey" FOREIGN KEY (probandset_id) REFERENCES probandset(id)

explain select m.id as mkrid
, min(pv(s.events_less, s.events_equal, s.events_greater, 0)) as optval
from marker m join segment s on m.basepos between s.startbase and s.endbase
and m.chrom = 1
and s.chrom = 1
and s.markerset_id = 'afad387e-a8e7-4c3b-9adb-3f00e70a13b3'
join probandset r on s.probandset_id = r.id
and r.people_id = 'b124acb2-e400-4e63-8010-5a5f25a78f1c'
group by m.id

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Celia McInnis 2017-10-25 20:59:41 Re: How do I insert and update into a table of arrays of composite types via a select command?
Previous Message Tom Lane 2017-10-25 19:47:01 Re: How do I insert and update into a table of arrays of composite types via a select command?