Re: Query Performance / Planner estimate off

From: Mats Olsen <mats(at)duneanalytics(dot)com>
To: Michael Lewis <mlewis(at)entrata(dot)com>
Cc: Sebastian Dressler <sebastian(at)swarm64(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Query Performance / Planner estimate off
Date: 2020-10-22 06:21:46
Message-ID: 0e7bd7e2-84c9-f261-9129-6eb4cf380b48@duneanalytics.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


On 10/21/20 5:29 PM, Michael Lewis wrote:
>
>
> On Wed, Oct 21, 2020, 8:42 AM Mats Olsen <mats(at)duneanalytics(dot)com
> <mailto:mats(at)duneanalytics(dot)com>> wrote:
>
>
> On 10/21/20 2:38 PM, Sebastian Dressler wrote:
>> Hi Mats,
>>
>>> On 20. Oct 2020, at 11:37, Mats Julian Olsen
>>> <mats(at)duneanalytics(dot)com <mailto:mats(at)duneanalytics(dot)com>> wrote:
>>>
>>> [...]
>>>
>>> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
>>> <https://explain.depesz.com/s/NvDR>
>>> 2) enable_nestloop=off (4 min):
>>> https://explain.depesz.com/s/buKK
>>> <https://explain.depesz.com/s/buKK>
>>> 3) enable_nestloop=off; enable_seqscan=off (2 min):
>>> https://explain.depesz.com/s/0WXx
>>> <https://explain.depesz.com/s/0WXx>
>>>
>>> How can I get Postgres not to loop over 12M rows?
>>
>> I looked at the plans and your config and there are some thoughts
>> I'm having:
>>
>> - The row estimate is off, as you possibly noticed. This can be
>> possibly solved by raising `default_statistics_target` to e.g.
>> 2500 (we typically use that) and run ANALYZE
> I've `set default_statistics_target=2500` and ran analyze on both
> tables involved, unfortunately the plan is the same. The columns
> we use for joining here are hashes and we expect very few
> duplicates in the tables. Hence I think extended statistics
> (storing most common values and histogram bounds) aren't useful
> for this kind of data. Would you say the same thing?
>
>
> Have you checked if ndistinct is roughly accurate? It can be set
> manually on a column, or set to some value less than one with the
> calculation depending on reltuples.
Thank you for your reply!

I included ndistinct-counts in the gist: see
https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/24ca1f227940b48842a03435b731f82364f3576d/stats%2520Mint
and
https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/24ca1f227940b48842a03435b731f82364f3576d/stats%2520transactions.

The join keys `transactions.hash` (unique) and
`"Pair_evt_Mint".evt_tx_hash` (nearly unique) both have ndistinct=-1
which seems to make sense to me. The Mint-table has -0.8375 for
evt_block_time whereas this query returns 0.56 `select count(distinct
evt_block_time)::numeric/count(*) from uniswap_v2."Pair_evt_Mint";`.
Should I adjust that one?

Many of the other ndistinct-values for `transactions` seem strange, as
it's a giant (partitioned) table, but I don't know enough about the
statistics to draw any conclusions from it. What do you think?

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2020-10-22 06:37:01 Re: Query Performance / Planner estimate off
Previous Message aditya desai 2020-10-22 05:36:08 Re: CPU Consuming query. Sequential scan despite indexing.