Re: Query is slower with a large proportion of NULLs in several columns

From: Lars Bergeson <larsavatar(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Query is slower with a large proportion of NULLs in several columns
Date: 2021-12-21 04:11:42
Message-ID: CAK66RGbd8n46xEp3fTKOa3pO3eC=NT8tEHHZYjuNZq4Y0SQHTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

ok, here are results after I did:
set max_parallel_workers_per_gather = 0;

no nulls table is 11.462 GB:
QUERY PLAN
HashAggregate (cost=1676432.13..1676432.16 rows=3 width=15) (actual
time=19908.343..19908.345 rows=5 loops=1)
Group Key: roys_creation_user
Batches: 1 Memory Usage: 24kB
I/O Timings: read=532369.898
-> Seq Scan on royalty_no_null (cost=0.00..1583887.42 rows=18508942
width=7) (actual time=0.013..16705.734 rows=18508470 loops=1)
I/O Timings: read=532369.898
Settings: effective_cache_size = '21553496kB', maintenance_io_concurrency =
'1', max_parallel_workers_per_gather = '0', search_path = 'public, public,
"$user"'
Planning Time: 0.056 ms
Execution Time: 19908.383 ms

with nulls table is 9.120 GB:
QUERY PLAN
HashAggregate (cost=1390580.70..1390580.72 rows=2 width=15) (actual
time=30369.758..30369.761 rows=5 loops=1)
Group Key: roys_creation_user
Batches: 1 Memory Usage: 24kB
I/O Timings: read=6440851.540
-> Seq Scan on royalty_with_null (cost=0.00..1298048.80 rows=18506380
width=7) (actual time=0.015..25525.104 rows=18508470 loops=1)
I/O Timings: read=6440851.540
Settings: effective_cache_size = '21553496kB', maintenance_io_concurrency =
'1', max_parallel_workers_per_gather = '0', search_path = 'public, public,
"$user"'
Planning Time: 0.060 ms
Execution Time: 30369.796 ms

Still taking 10X more I/O to read the smaller table. Very odd.

Regarding the earlier comment from David Johnston: If I put null values in
just one of the columns for all rows, it should force a null bitmap to be
created for every row, with the same amount of checking of the bitmap
required. However, the query still runs faster even though the table is
larger:
with nulls table is 11.604 GB when all values are filled except 1 column
has mostly nulls. The extra 0.14 GB (11.604 GB - 11.462 GB) is probably
space consumed by null bitmaps:
QUERY PLAN
HashAggregate (cost=1693765.03..1693765.06 rows=3 width=15) (actual
time=26452.653..26452.655 rows=5 loops=1)
Group Key: roys_creation_user
Batches: 1 Memory Usage: 24kB
I/O Timings: read=2706123.209
-> Seq Scan on royalty_with_null_cols_filled (cost=0.00..1601218.02
rows=18509402 width=7) (actual time=0.014..22655.366 rows=18508470 loops=1)
I/O Timings: read=2706123.209
Settings: effective_cache_size = '21553496kB', maintenance_io_concurrency =
'1', max_parallel_workers_per_gather = '0', search_path = 'public, public,
"$user"'
Planning Time: 0.068 ms
Execution Time: 26452.691 ms

It seems to be the actual presence of null values that slows things down,
even when the same sized null bitmap exists for each row.

On Mon, Dec 20, 2021 at 5:51 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Lars Bergeson <larsavatar(at)gmail(dot)com> writes:
> > What is it about null values in the table that slows down the full table
> > scan?
>
> If a row has any nulls, then it contains a "nulls bitmap" [1] that says
> which columns are null, and that bitmap has to be consulted while
> walking through the row contents. So the most obvious theory here
> is that that adds overhead that's significant in your case. But there
> are some holes in that theory, mainly that the I/O timings you are
> showing don't seem very consistent:
>
> no nulls:
> > I/O Timings: read=1884365.335
> > Execution Time: 11135.368 ms
>
> with nulls:
> > I/O Timings: read=17141420.771
> > Execution Time: 25407.318 ms
>
> Regardless of CPU time required, it should not take 10X less I/O
> time to read a physically larger table. So there's something
> fairly bogus going on there. One thing you might try is disabling
> parallelism (set max_parallel_workers_per_gather = 0) to see if
> that's confusing the numbers somehow.
>
> regards, tom lane
>
> [1]
> https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-TUPLE-LAYOUT
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2021-12-21 04:51:38 Re: Query is slower with a large proportion of NULLs in several columns
Previous Message Tom Lane 2021-12-21 01:51:59 Re: Query is slower with a large proportion of NULLs in several columns