Quick Links

Performance of lateral join

From:	Simen Andreas Andreassen Lønsethagen <simen(dot)lonsethagen(at)fremtind(dot)no>
To:	"pgsql-performance(at)lists(dot)postgresql(dot)org" <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject:	Performance of lateral join
Date:	2021-07-26 13:56:54
Message-ID:	A67CAB4B-C442-4641-BDF1-EDBE1DBF5876@fremtind.no
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi, first time posting, hope I have included the relevant information.

I am trying to understand the performance of a query which is intended to retrieve a subset of the following table:

dpd=> SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='bis_person_alle_endringer';
relname | relpages | reltuples | relallvisible | relkind | relnatts | relhassubclass | reloptions | pg_table_size
---------------------------+----------+-------------+---------------+---------+----------+----------------+------------+---------------
bis_person_alle_endringer | 9367584 | 1.09584e+08 | 6392129 | r | 106 | f | | 76760489984
(1 row)

I have ommitted most of the columns, as there are 106 columns in total. The ommitted columns have data types text, numeric or date, all are nullable.

To create the subsets, I (or rather my application) will receive lists of records which should be matched according to some business logic. Each of these lists will be read into a temporary table:

The subset is then created by the following query, which finds the records in contracts.bis_person_alle_endringer which satisfies the business logic (if any).

select * from records_to_filter_on r
left join lateral (
select * from contracts.bis_person_alle_endringer b
where b.dpd_bis_foedselsnummer = r.foedselsnummer AND
r.kjent_i_system_dato >= b.dpd_endret_tidspunkt AND
r.tariff_dato > b.dpd_gyldig_fra_dato
order by b.dpd_gyldig_fra_dato desc, b.dpd_endret_tidspunkt desc
limit 1
) c on true
where person_id is not null and
r.versjons_dato < c.dpd_i_kraft_til_dato

The temporary table records_to_filter_on and the result of the above query will typically contain 1-5 million rows (the returned subsets are used for training machine learning models).

I've created a sample data set with 3.75 million rows and run EXPLAIN (ANALYZE, BUFFERS) on the query, https://explain.dalibo.com/plan/U41 (and also attached). Running the full EXPLAIN (ANALYZE, BUFFERS) takes about 30 minutes, which seems quite slow. However, as I am new to postgres, I find it difficult to interpret the output of the EXPLAIN (ANALYZE, BUFFERS) - most of the time is spent during an index scan, which to my understanding is "good". However, I don't think I understand postgres well enough to judge whether this is the best I can achieve (or at last close enough) or if the query should be rewritten. Alternatively, is it not realistic to expect faster performance given the size of the table and the hardware of the database instance?

I am running PostgreSQL 11.9 on x86_64-pc-linux-gnu using AWS Aurora on a db.t3.large instance (https://aws.amazon.com/rds/instance-types/) The output of

SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');

is attached in pg_settings.conf.

I realize that these questions are a little vague, but any guidance would be much appreciated.

Thanks, Simen Lønsethagen

Responses

Re: Performance of lateral join at 2021-07-26 18:17:45 from Justin Pryzby

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Justin Pryzby	2021-07-26 18:17:45	Re: Performance of lateral join
Previous Message	ldh@laurent-hasson.com	2021-07-26 05:12:30	RE: Big performance slowdown from 11.2 to 13.3