Irrelevant columns cause massive performance change

From: Craig James <cjames(at)emolecules(dot)com>
To: "pgsql-performa(dot)" <pgsql-performance(at)postgresql(dot)org>
Subject: Irrelevant columns cause massive performance change
Date: 2018-03-16 20:37:05
Message-ID: CAFwQ8rdgsRC+zJrcmqpRx8GeU0h_OSnhsuW8GQgLze9RzCUmYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Here's a weird one I can't figure out: the definitions of several columns
of a view, which are not used in a query at all, have a massive effect on
the query planner, causing it to choose a seqscan over the largest table in
our database when it should be using the primary key for the join.
Background: We've redesigned the tables that hold our primary data, but
need to create views that mimic the old design so that our applications
will continue working. The largest table ("chemaxon.sdf") holds the bulk of
the data, and we've changed it from raw text to gzipped bytea. To mimic the
old schema, I created a short Perl function that does a simple gunzip
operation, and used that in the definition of the view "str_conntab". (This
gzip reduces our total database size to about a third of the original --
it's very effective).

Here are two query plans. The first is horrible. For the second, I removed
the gunzip functions and replaced them with constant values. But notice
that these pseudo columns are not used anywhere in the query. (Even if they
were, I don't understand why this should affect the planner.)

The tables VERSION and VERSION_PROPERTIES are also views; I've included
their definitions and the underlying actual tables below.

Postgres 9.6.7 running on Ubuntu 16.04.

emolecules=> drop view str_conntab;
DROP VIEW
emolecules=> create view str_conntab as
emolecules-> (select
emolecules(> id,
emolecules(> length(gunzip(sdf_gzip)) as contab_len,
emolecules(> gunzip(sdf_gzip) as contab_data,
emolecules(> ''::text as normalized
emolecules(> from chemaxon.sdf);
CREATE VIEW

emolecules=> explain analyze
select VERSION.VERSION_ID, VERSION.ISOSMILES,
VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA
from VERSION
join VERSION_PROPERTIES on (VERSION.VERSION_ID =
VERSION_PROPERTIES.VERSION_ID)
join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)
where VERSION.VERSION_ID in
(1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=62.99..162425.77 rows=5 width=60) (actual
time=34.718..152828.351 rows=10 loops=1)
Join Filter: (s.id = p_1.id)
-> Nested Loop (cost=62.56..162422.84 rows=6 width=55) (actual
time=34.701..152828.289 rows=10 loops=1)
Join Filter: (s.id = parent.id)
-> Nested Loop (cost=62.14..162419.48 rows=7 width=51) (actual
time=34.694..152828.250 rows=10 loops=1)
Join Filter: (s.id = p.id)
-> Hash Join (cost=61.72..162415.16 rows=9 width=47)
(actual time=34.663..152828.110 rows=10 loops=1)
Hash Cond: (sdf.id = s.id)
-> Seq Scan on sdf (cost=0.00..158488.50 rows=281080
width=72) (actual time=33.623..152630.514 rows=281080 loops=1)
-> Hash (cost=61.59..61.59 rows=10 width=43) (actual
time=0.028..0.028 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using smiles_pkey on smiles s
(cost=0.42..61.59 rows=10 width=43) (actual time=0.010..0.022 rows=10
loops=1)
Index Cond: (id = ANY
('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))
-> Index Only Scan using parent_pkey on parent p
(cost=0.42..0.47 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Only Scan using parent_pkey on parent (cost=0.42..0.47
rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Scan using properties_pkey on properties p_1 (cost=0.42..0.48
rows=1 width=21) (actual time=0.003..0.004 rows=1 loops=10)
Index Cond: (id = sdf.id)
Planning time: 1.330 ms
Execution time: 152828.506 ms
(23 rows)

emolecules=> drop view str_conntab;
DROP VIEW
emolecules=> create view str_conntab as
emolecules-> (select
emolecules(> id,
emolecules(> 0::integer contab_len,
emolecules(> null::text as contab_data,
emolecules(> ''::text as normalized
emolecules(> from chemaxon.sdf);
CREATE VIEW
emolecules=> explain analyze
select VERSION.VERSION_ID, VERSION.ISOSMILES,
VERSION_PROPERTIES.MOLECULAR_WEIGHT, VERSION_PROPERTIES.MOLECULAR_FORMULA
from VERSION
join VERSION_PROPERTIES on (VERSION.VERSION_ID =
VERSION_PROPERTIES.VERSION_ID)
join STR_CONNTAB on (VERSION.VERSION_ID = STR_CONNTAB.ID)
where VERSION.VERSION_ID in
(1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884);
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=2.11..156.71 rows=5 width=60) (actual time=0.018..0.096
rows=10 loops=1)
Join Filter: (s.id = p_1.id)
-> Nested Loop (cost=1.69..153.77 rows=6 width=55) (actual
time=0.015..0.076 rows=10 loops=1)
Join Filter: (s.id = parent.id)
-> Nested Loop (cost=1.27..150.41 rows=7 width=51) (actual
time=0.012..0.059 rows=10 loops=1)
Join Filter: (s.id = p.id)
-> Nested Loop (cost=0.84..146.09 rows=9 width=47) (actual
time=0.008..0.037 rows=10 loops=1)
-> Index Scan using smiles_pkey on smiles s
(cost=0.42..61.59 rows=10 width=43) (actual time=0.003..0.016 rows=10
loops=1)
Index Cond: (id = ANY
('{1485909,1485889,1485903,1485887,1485892,1485900,1485895,1485898,1485906,1485884}'::integer[]))
-> Index Only Scan using sdf_pkey on sdf
(cost=0.42..8.44 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=10)
Index Cond: (id = s.id)
Heap Fetches: 10
-> Index Only Scan using parent_pkey on parent p
(cost=0.42..0.47 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Only Scan using parent_pkey on parent (cost=0.42..0.47
rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=10)
Index Cond: (id = sdf.id)
Heap Fetches: 10
-> Index Scan using properties_pkey on properties p_1 (cost=0.42..0.48
rows=1 width=21) (actual time=0.001..0.002 rows=1 loops=10)
Index Cond: (id = sdf.id)
Planning time: 1.251 ms
Execution time: 0.147 ms
(22 rows)

The timing of the second query is excellent, and is what I expected. I
don't understand why including a function-defined column in the view would
have such a dramatic effect on the planner's ability to choose the sdf_pkey
index for the join.

Here are the view and table definitions:

emolecules=> \d+ version
View "registry.version"
Column | Type | Collation | Nullable | Default | Storage |
Description
------------+---------+-----------+----------+---------+----------+-------------
version_id | integer | | | | plain |
parent_id | integer | | | | plain |
isosmiles | text | | | | extended |
created | abstime | | | | plain |
View definition:
SELECT s.id AS version_id,
p.parent_id,
s.smiles AS isosmiles,
timenow() AS created
FROM chemaxon.smiles s
JOIN chemaxon.parent p ON s.id = p.id;

emolecules=> \d+ version_properties
View "registry.version_properties"
Column | Type | Collation | Nullable | Default |
Storage | Description
-------------------+--------------+-----------+----------+---------+----------+-------------
version_id | integer | | | | plain
|
molecular_weight | numeric(8,3) | | | | main
|
molecular_formula | text | | | |
extended |
mfcd | text | | | |
extended |
cas_number | text | | | |
extended |
View definition:
SELECT p.id AS version_id,
p.molecular_weight,
p.molecular_formula,
m.mfcd,
c.cas_number
FROM chemaxon.properties p
LEFT JOIN chemaxon.mfcd m USING (id)
LEFT JOIN chemaxon.cas_number c USING (id)
JOIN chemaxon.parent USING (id);

Table "chemaxon.smiles"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | not null | | plain |
|
smiles | text | | not null | | extended |
|
Indexes:
"smiles_pkey" PRIMARY KEY, btree (id)
"i_unique_smiles" UNIQUE, btree (smiles)

Table "chemaxon.parent"
Column | Type | Collation | Nullable | Default | Storage | Stats
target | Description
-----------+---------+-----------+----------+---------+---------+--------------+-------------
id | integer | | not null | | plain |
|
parent_id | integer | | not null | | plain |
|
Indexes:
"parent_pkey" PRIMARY KEY, btree (id)
"i_parent_parent_id" btree (parent_id)

emolecules=> \d chemaxon.cas_number
Table "chemaxon.cas_number"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
id | integer | | not null |
cas_number | text | | |
Indexes:
"cas_number_pkey" PRIMARY KEY, btree (id)
"i_cas_number_cas_number" btree (cas_number)

emolecules=> \d chemaxon.cas_number
Table "chemaxon.cas_number"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
id | integer | | not null |
cas_number | text | | |
Indexes:
"cas_number_pkey" PRIMARY KEY, btree (id)
"i_cas_number_cas_number" btree (cas_number)

And the function "gunzip" is defined in perl (unsafe Perl) as:

create or replace function gunzip(bytea) returns text as
$gunzip$
use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
my $compressed = decode_bytea($_[0]);
my $uncompressed;
if (!gunzip(\$compressed, \$uncompressed)) {
return $GunzipError;
}
return $uncompressed;
$gunzip$
language plperlu;

Thanks!
Craig

--
---------------------------------
Craig A. James
Chief Technology Officer
eMolecules, Inc.
---------------------------------

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Andres Freund 2018-03-16 20:50:13 Re: Irrelevant columns cause massive performance change
Previous Message MichaelDBA 2018-03-16 13:42:34 Re: PG 9.6 Slow inserts with long-lasting LWLocks