Re: Merge David and Goliath tables efficiently

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: nicolas paris <nicolas(dot)paris(at)riseup(dot)net>, pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Merge David and Goliath tables efficiently
Date: 2023-06-17 23:48:51
Message-ID: 07f2d1b3-c159-dee9-a059-1f71027fe8cd@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 6/17/23 23:42, nicolas paris wrote:
>>> My interpretation reading the query plan is: well sized small
>>> batches of upserts leverage the indexes while the regular join
>>> choose the sequential scan, including sorting and hashing which
>>> takes forever time and resources including disk.
>>
>> You may be right, but it's hard to tell without seeing the query
>> plan.
>
> Here are part of both plans:
>

I don't understand why you're sharing just a part of the plan and not
the whole thing, ideally including the actual update ... Giving us the
info in small pieces just means we need to guess and speculate.

> Bad case (strategy 2.1):
>
> -> Merge Left Join (cost=530202629.03..255884257913.32
> rows=17023331531230 width=579)
> Merge Cond: (david.list_id = ca.list_id)
> -> Sort (cost=2019172.91..2024398.82 rows=2090361 width=569)
> Sort Key: david.list_id
> -> Append (cost=0.00..192152.41 rows=2090361 width=569)
> -> Seq Scan on david_0 david_1 (cost=0.00..1812.52
> rows=20852 width=569)
> -> Seq Scan on david_1 david_2 (cost=0.00..1800.09
> rows=20709 width=569)
> -> Seq Scan on david_2 david_3 (cost=0.00..1794.44
> rows=20644 width=569)
>

Well, I kinda doubt you have 17023331531230 rows (not even physically
possible with 2TB disk), so that's immediately suspicious. I'd bet the
UPDATE ... FROM ... is missing a condition or something like that, which
results in a cartesian product.

> Good case (strategy 3):
>
> Merge on goliath_23 ca (cost=2139.75..11077.17 rows=0 width=0)
> -> Nested Loop Left Join (cost=2139.75..11077.17 rows=1000
> width=575)
> -> Limit (cost=2139.19..2495.67 rows=1000 width=569)
> -> Index Scan using david_23_list_id_account_id_idx on
> david_23 (cost=0.29..6794.16 rows=19058 width=569)
> -> Index Scan using goliath_23_list_id_account_id_idx on
> goliath_23 ca (cost=0.56..8.56 rows=1 width=14)
> Index Cond: (list_id = david_23.list_id)
>
>>
>> Sounds very much like you'd benefit from tuning some cost parameters
>> to
>> make the index scan look cheaper.
>> Not sure what 'batched indexed join' would be, but it very much
>> sounds
>> like a nested loop with an index scan.
>
> Agreed, a 2M nested loop over index scan would likely work as well.
> Would tuning the costs param could lead to get such large nested loop ?
>

It should be, but maybe let's see if there are other problems in the
query itself. If it's generating a cartesian product, it's pointless to
tune parameters.

>> What PostgreSQL version are you using, what hardware? Did you tune it
>> in
>> any way, or is everything just default?
>
> It is pg 15.3, on 2 cores / 8GO / 2TO ssds, with defaults cloud
> provider parameters (RDS).
>

I assume 2TO is 2TB?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message nicolas paris 2023-06-18 20:57:11 Re: Merge David and Goliath tables efficiently
Previous Message nicolas paris 2023-06-17 21:42:48 Re: Merge David and Goliath tables efficiently