assertion failure at cost_memoize_rescan()

From: Kohei KaiGai <kaigai(at)heterodb(dot)com>
To: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: assertion failure at cost_memoize_rescan()
Date: 2024-06-14 12:54:34
Message-ID: CAOP8fzZnTU+N64UYJYogb1hN-5hFP+PwTb3m_cnGAD7EsQwrKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

I met an assertion failure, and identified the root of the problem, but no
idea how to fix it.

The location of the problematic Assert() is at cost_memoize_rescan() to
check 'hit_ratio' is between 0.0 and 1.0.
The 'calls' is provided by the caller, and 'ndistinct' is the result
of estimate_num_groups().

#4 0x000000000084d583 in cost_memoize_rescan (root=0x2e95748,
mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260,
rescan_total_cost=0x7ffd72141258) at costsize.c:2564
/home/kaigai/source/pgsql-16/src/backend/optimizer/path/costsize.c:2564:83932:beg:0x84d583
(gdb) l
2559 * how many of those scans we expect to get a cache hit.
2560 */
2561 hit_ratio = ((calls - ndistinct) / calls) *
2562 (est_cache_entries / Max(ndistinct,
est_cache_entries));
2563
2564 Assert(hit_ratio >= 0 && hit_ratio <= 1.0);
2565
2566 /*
2567 * Set the total_cost accounting for the expected cache hit
ratio. We
2568 * also add on a cpu_operator_cost to account for a cache
lookup. This

(gdb) bt
#0 0x00007f3a39aa154c in __pthread_kill_implementation () from
/lib64/libc.so.6
#1 0x00007f3a39a54d06 in raise () from /lib64/libc.so.6
#2 0x00007f3a39a287f3 in abort () from /lib64/libc.so.6
#3 0x0000000000b6ff2c in ExceptionalCondition (conditionName=0xd28c28
"hit_ratio >= 0 && hit_ratio <= 1.0", fileName=0xd289a4 "costsize.c",
lineNumber=2564) at assert.c:66
#4 0x000000000084d583 in cost_memoize_rescan (root=0x2e95748,
mpath=0x30aece8, rescan_startup_cost=0x7ffd72141260,
rescan_total_cost=0x7ffd72141258) at costsize.c:2564
#5 0x0000000000850831 in cost_rescan (root=0x2e95748, path=0x30aece8,
rescan_startup_cost=0x7ffd72141260, rescan_total_cost=0x7ffd72141258) at
costsize.c:4350
#6 0x000000000084e333 in initial_cost_nestloop (root=0x2e95748,
workspace=0x7ffd721412d0, jointype=JOIN_INNER, outer_path=0x3090058,
inner_path=0x30aece8, extra=0x7ffd72141500) at costsize.c:2978
#7 0x0000000000860f58 in try_partial_nestloop_path (root=0x2e95748,
joinrel=0x30ae158, outer_path=0x3090058, inner_path=0x30aece8,
pathkeys=0x0, jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:887
#8 0x0000000000862a64 in consider_parallel_nestloop (root=0x2e95748,
joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390,
jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:2083
#9 0x000000000086273d in match_unsorted_outer (root=0x2e95748,
joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390,
jointype=JOIN_INNER, extra=0x7ffd72141500) at joinpath.c:1940
#10 0x00000000008600f0 in add_paths_to_joinrel (root=0x2e95748,
joinrel=0x30ae158, outerrel=0x308f428, innerrel=0x2eac390,
jointype=JOIN_INNER, sjinfo=0x7ffd721415f0, restrictlist=0x30ae5a8) at
joinpath.c:296
#11 0x0000000000864d10 in populate_joinrel_with_paths (root=0x2e95748,
rel1=0x308f428, rel2=0x2eac390, joinrel=0x30ae158, sjinfo=0x7ffd721415f0,
restrictlist=0x30ae5a8) at joinrels.c:925
#12 0x00000000008649e1 in make_join_rel (root=0x2e95748, rel1=0x308f428,
rel2=0x2eac390) at joinrels.c:776
#13 0x0000000000863ec1 in make_rels_by_clause_joins (root=0x2e95748,
old_rel=0x308f428, other_rels_list=0x3088ed0, other_rels=0x3088ee8) at
joinrels.c:312
#14 0x000000000086399a in join_search_one_level (root=0x2e95748, level=3)
at joinrels.c:123
#15 0x00000000008463f8 in standard_join_search (root=0x2e95748,
levels_needed=4, initial_rels=0x3088ed0) at allpaths.c:3454
#16 0x000000000084636d in make_rel_from_joinlist (root=0x2e95748,
joinlist=0x306b4f8) at allpaths.c:3385
#17 0x0000000000841548 in make_one_rel (root=0x2e95748, joinlist=0x306b4f8)
at allpaths.c:229
#18 0x00000000008806a9 in query_planner (root=0x2e95748,
qp_callback=0x886bcb <standard_qp_callback>, qp_extra=0x7ffd72141960) at
planmain.c:278
#19 0x0000000000882f5f in grouping_planner (root=0x2e95748,
tuple_fraction=0) at planner.c:1495
#20 0x000000000088268c in subquery_planner (glob=0x2e95348,
parse=0x2e90e98, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at
planner.c:1064
#21 0x0000000000880cdb in standard_planner (parse=0x2e90e98,
query_string=0x2e3a0e8 "explain\nselect sum(lo_revenue), d_year,
p_brand1\n from lineorder, date1, part, supplier\n where lo_orderdate =
d_datekey\n and lo_partkey = p_partkey\n and lo_suppkey = s_suppkey\n
and p_brand1"..., cursorOptions=2048,
boundParams=0x0) at planner.c:413

I tracked the behavior of estimate_num_groups() using gdb line-by-line to
observe how 'input_rows' is changed
and how it affects the result value.
According to the call trace, the problematic estimate_num_groups()
invocation is called with "input_rows=3251872.916666667",
then it was rounded up to 3251873 by the clamp_row_est(). Eventually, its
result value was calculated larger than the upper
limit, so the return value was suppressed by 3251873, but it is a tiny bit
larger than the input value!

Back to the cost_memoize_rescan().
The hit_ratio is calculated as follows:

hit_ratio = ((calls - ndistinct) / calls) *
(est_cache_entries / Max(ndistinct, est_cache_entries));

The "calls" is the "input_rows" above, and "ndistinct" is the return value
of the estimate_num_groups().
What happen if "ndistinct" is a tiny bit larger than "calls"?
In the results, the "hit_ratio" is calculated as a very small negative
value, then it was terminated by Assert().

How do we fix the logic? Please some ideas.

Best regards,
--
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei <kaigai(at)heterodb(dot)com>

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2024-06-14 13:30:55 Re: Issues with ON CONFLICT UPDATE and REINDEX CONCURRENTLY
Previous Message Alexander Korotkov 2024-06-14 12:46:15 Re: [HACKERS] make async slave to wait for lsn to be replayed