Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From: Vijaykumar Jain <vijaykumarjain(dot)github(at)gmail(dot)com>
To: Pól Ua Laoínecháin <linehanp(at)tcd(dot)ie>
Cc: pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.
Date: 2021-06-04 11:27:55
Message-ID: CAM+6J97chj6Gyea_h03ec2YyN-3LeF4d0_NdFE0NRT_PY5pTJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

how is cost calculated?
postgres/costsize.c at master · postgres/postgres (github.com)
<https://github.com/postgres/postgres/blob/master/src/backend/optimizer/path/costsize.c>

row estimation
PostgreSQL: Documentation: 13: 70.1. Row Estimation Examples
<https://www.postgresql.org/docs/current/row-estimation-examples.html>

FYI, I just read it when I had queries like why rows were estimated when
the table has now rows.
but i think it requires a deeper reading into the source for cost
calculation which i did not do at that time as i did not understand :)
but incase above helps to start.

On Fri, 4 Jun 2021 at 16:29, Pól Ua Laoínecháin <linehanp(at)tcd(dot)ie> wrote:

> Hi all,
>
> Noticed this today - relatively simple query - table with 7 records
> (all code is shown at the bottom of this post and on the fiddle here):
>
>
> https://dbfiddle.uk/?rdbms=postgres_12&fiddle=efe73a37d29af43f33d2bc79de2b6c97
>
> Sample (2 of 7 records);
>
> ======================
> INSERT INTO test
> VALUES
> (1, 'Pól' , '2021-06-01', '2021-06-06'),
> (2, 'Bill' , '2021-06-02', '2021-06-10');
> =========================
>
> Query:
>
> ========================
> SELECT
> id,
> GENERATE_SERIES
> (t.start_date, t.end_date, '1 DAY')::DATE AS sd,
> t.end_date
> FROM test t
> ORDER BY t.id, t.start_date;
> =========================
>
> Now, the EXPLAIN (ANALYZE, BUFFERS) of this query is as follow - see
> fiddle - with 7 records:
>
> ====================
> QUERY PLAN
> Sort (cost=165708.62..168608.62 rows=1160000 width=10) (actual
> time=0.083..0.087 rows=42 loops=1)
> Sort Key: id, (((generate_series((start_date)::timestamp with time
> zone, (end_date)::timestamp with time zone, '1
> day'::interval)))::date)
> Sort Method: quicksort Memory: 26kB
> Buffers: shared hit=1
> -> Result (cost=0.00..29036.10 rows=1160000 width=10) (actual
> time=0.019..0.056 rows=42 loops=1)
> Buffers: shared hit=1
> -> ProjectSet (cost=0.00..5836.10 rows=1160000 width=14) (actual
> time=0.018..0.042 rows=42 loops=1)
> Buffers: shared hit=1
> -> Seq Scan on test t (cost=0.00..21.60 rows=1160 width=10) (actual
> time=0.008..0.010 rows=7 loops=1)
> Buffers: shared hit=1
> Planning Time: 0.061 ms
> Execution Time: 0.131 ms
> 12 rows
> ===========================
>
> Now, the first line of the PLAN has
>
> > Sort (cost=165708.62..168608.62 rows=1160000 width=10)
>
> and two other lines below this also contain the figure 1,160,000.
>
> Where *_on earth_* is PostgreSQL obtaining 1.16M rows? And where do
> the cost numbers come from for a query on 7 records?
>
> This query - a recursive CTE (no GENERATE_SERIES) as follows:
>
> ===================
> WITH RECURSIVE cte (id, sd, ed) AS
> (
> SELECT t.id, t.start_date, t.end_date
> FROM test t
> UNION ALL
> SELECT c.id, (c.sd + INTERVAL '1 DAY')::DATE, c.ed
> FROM cte c
> WHERE c.sd < (SELECT z.end_date FROM test z WHERE z.id = c.id)
> )
> SELECT * FROM cte c2
> ORDER BY c2.id, c2.sd, c2.ed;
> =============================
>
> gives a PLAN as follows:
>
> ==========================
> QUERY PLAN
> Sort (cost=955181.47..955281.05 rows=39830 width=10) (actual
> time=0.262..0.266 rows=42 loops=1)
> Sort Key: c2.id, c2.sd, c2.ed
> Sort Method: quicksort Memory: 26kB
> Buffers: shared hit=85
> CTE cte
> -> Recursive Union (cost=0.00..951341.55 rows=39830 width=10) (actual
> time=0.010..0.216 rows=42 loops=1)
> Buffers: shared hit=85
> -> Seq Scan on test t (cost=0.00..21.60 rows=1160 width=10) (actual
> time=0.009..0.010 rows=7 loops=1)
> Buffers: shared hit=1
> -> WorkTable Scan on cte c (cost=0.00..95052.33 rows=3867 width=10)
> (actual time=0.006..0.020 rows=4 loops=9)
> Filter: (sd < (SubPlan 1))
> Rows Removed by Filter: 1
> Buffers: shared hit=84
> SubPlan 1
> -> Index Scan using test_pkey on test z (cost=0.15..8.17 rows=1
> width=4) (actual time=0.003..0.003 rows=1 loops=42)
> Index Cond: (id = c.id)
> Buffers: shared hit=84
> -> CTE Scan on cte c2 (cost=0.00..796.60 rows=39830 width=10) (actual
> time=0.011..0.233 rows=42 loops=1)
> Buffers: shared hit=85
> Planning Time: 0.137 ms
> Execution Time: 0.324 ms
> 21 rows
> ===============================
>
> Now, this PLAN is more complicated - and I totally get that! However,
> where do these numbers:
>
> > (cost=955181.47..955281.05 rows=39830 width=10)
>
> come from?
>
> Now, we've gone from 1,160,000 to 39,830 rows (progress? :-) ) and a
> cost of ~ 1M (compared with 168k for the first query).
>
> I probably need to read up on the EXPLAIN (ANALYZE, BUFFERS)
> functionality - but I would appreciate:
>
> a) an (overview) explanation of what's going on here in particular and
>
> b) any good references to URLs, papers whatever which would be of
> benefit to a (hopefully) reasonably competent SQL programmer with a
> desire to grasp internals - how to interpret PostgreSQL's EXPLAIN
> functionality.
>
> TIA and rgs,
>
>
> Pól...
>
>
>

--
Thanks,
Vijay
Mumbai, India

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ram Pratap Maurya 2021-06-04 11:46:21 RE: BUG #17046: Upgrade postgres 11 to 13 version
Previous Message Pól Ua Laoínecháin 2021-06-04 10:58:55 EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.