From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> |
Cc: | Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: MergeJoin beats HashJoin in the case of multiple hash clauses |
Date: | 2023-09-07 18:08:56 |
Message-ID: | ZPoRuHWPZeu3qC/a@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Does anyone else have an opinion on this patch? It looks promising.
---------------------------------------------------------------------------
On Wed, Jun 28, 2023 at 04:53:06PM +0300, Alena Rybakina wrote:
> Hi!
>
> On 15.06.2023 11:30, Andrey Lepikhov wrote:
>
> Hi, all.
>
> Some of my clients use JOIN's with three - four clauses. Quite frequently,
> I see complaints on unreasonable switch of JOIN algorithm to Merge Join
> instead of Hash Join. Quick research have shown one weak place - estimation
> of an average bucket size in final_cost_hashjoin (see q2.sql in attachment)
> with very conservative strategy.
> Unlike estimation of groups, here we use smallest ndistinct value across
> all buckets instead of multiplying them (or trying to make multivariate
> analysis).
> It works fine for the case of one clause. But if we have many clauses, and
> if each has high value of ndistinct, we will overestimate average size of a
> bucket and, as a result, prefer to use Merge Join. As the example in
> attachment shows, it leads to worse plan than possible, sometimes
> drastically worse.
> I assume, this is done with fear of functional dependencies between hash
> clause components. But as for me, here we should go the same way, as
> estimation of groups.
> The attached patch shows a sketch of the solution.
>
>
> This problem is very important.
>
> Honestly, I'm still learning your code and looking for cases on which cases
> your patch can affect for the worse or for the better. But I have already found
> something that seemed interesting to me. I have found several other interesting
> cases where your patch can solve some problem in order to choose a more correct
> plan, but in focus on memory consumption.
>
> To make it easier to evaluate, I added a hook to your patch that makes it
> easier to switch to your or the original way of estimating the size of baskets
> (diff_estimate.diff).
>
> Here are other cases where your fix improves the query plan.
>
>
> First of all, I changed the way creation of tables are created to look at the
> behavior of the query plan in terms of planning and execution time:
>
> DROP TABLE IF EXISTS a,b CASCADE;
> CREATE TABLE a AS
> SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
> FROM generate_series(1,1e5) AS gs;
> CREATE TABLE b AS
> SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
> FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
>
> SET enable_cost_size = 'on';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
>
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
>
>
> QUERY PLAN
> ---------------------------------------------------------------------------
> Hash Join (actual time=200.872..200.879 rows=0 loops=1)
> Hash Cond: ((b.x = a.x) AND (b.y = a.y) AND (b.z = a.z))
> -> Seq Scan on b (actual time=0.029..15.946 rows=100000 loops=1)
> -> Hash (actual time=97.645..97.649 rows=100000 loops=1)
> Buckets: 131072 Batches: 1 Memory Usage: 5612kB
> -> Seq Scan on a (actual time=0.024..17.153 rows=100000 loops=1)
> Planning Time: 2.910 ms
> Execution Time: 201.949 ms
> (8 rows)
>
> SET
> QUERY PLAN
> ---------------------------------------------------------------------------
> Merge Join (actual time=687.415..687.416 rows=0 loops=1)
> Merge Cond: ((b.y = a.y) AND (b.x = a.x) AND (b.z = a.z))
> -> Sort (actual time=462.022..536.716 rows=100000 loops=1)
> Sort Key: b.y, b.x, b.z
> Sort Method: external merge Disk: 3328kB
> -> Seq Scan on b (actual time=0.017..12.326 rows=100000 loops=1)
> -> Sort (actual time=111.295..113.196 rows=16001 loops=1)
> Sort Key: a.y, a.x, a.z
> Sort Method: external sort Disk: 2840kB
> -> Seq Scan on a (actual time=0.020..10.129 rows=100000 loops=1)
> Planning Time: 0.752 ms
> Execution Time: 688.829 ms
> (12 rows)
>
> Secondly, I found another case that is not related to the fact that the planner
> would prefer to choose merge join rather than hash join, but we have the
> opportunity to see that the plan has become better due to the consumption of
> less memory, and also takes less planning time.
>
> Here, with the same query, the planning time was reduced by 5 times, and the
> number of buckets by 128 times, therefore, memory consumption also decreased:
>
> DROP TABLE IF EXISTS a,b CASCADE;
>
> CREATE TABLE a AS
> SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
> FROM generate_series(1,600) AS gs;
> CREATE TABLE b AS
> SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
> FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
>
> SET enable_cost_size = 'on';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
>
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a,b
> WHERE a.x=b.x AND a.y=b.y AND a.z=b.z;
>
> QUERY
> PLAN
> ----------------------------------------------------------------------------------------------------------------
> Hash Join (cost=20.50..3157.58 rows=8 width=32) (actual time=95.648..95.651
> rows=0 loops=1)
> Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z =
> (a.z)::numeric))
> -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time=
> 0.027..17.980 rows=100000 loops=1)
> -> Hash (cost=10.00..10.00 rows=600 width=12) (actual time=2.046..2.047
> rows=600 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 34kB
> -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.022..0.315 rows=600 loops=1)
> Planning Time: 0.631 ms
> Execution Time: 95.730 ms
> (8 rows)
>
> SET
> QUERY
> PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Hash Join (cost=3387.00..8621.58 rows=8 width=32) (actual time=
> 102.873..102.877 rows=0 loops=1)
> Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND
> ((a.z)::numeric = b.z))
> -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.014..0.131 rows=600 loops=1)
> -> Hash (cost=1637.00..1637.00 rows=100000 width=20) (actual time=
> 101.920..101.921 rows=100000 loops=1)
> Buckets: 131072 Batches: 1 Memory Usage: 6474kB
> -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual
> time=0.013..16.349 rows=100000 loops=1)
> Planning Time: 0.153 ms
> Execution Time: 103.518 ms
> (8 rows)
>
> I also give an improvement relative to the left external or right connection:
>
> DROP TABLE IF EXISTS a,b CASCADE;
>
> CREATE TABLE a AS
> SELECT ((3*gs) % 300) AS x, ((3*gs+1) % 300) AS y, ((3*gs+2) % 300) AS z
> FROM generate_series(1,600) AS gs;
> CREATE TABLE b AS
> SELECT gs % 90 AS x, gs % 49 AS y, gs %100 AS z, 'abc' || gs AS payload
> FROM generate_series(1,1e5) AS gs;
> ANALYZE a,b;
>
>
> SET enable_cost_size = 'on';
>
> EXPLAIN ANALYZE
> SELECT * FROM a right join b
> on a.x=b.x AND a.y=b.y AND a.z=b.z;
>
> SET enable_cost_size = 'off';
> EXPLAIN ANALYZE
> SELECT * FROM a right join b
> on a.x=b.x AND a.y=b.y AND a.z=b.z;
>
> QUERY
> PLAN
> ----------------------------------------------------------------------------------------------------------------
> Hash Left Join (cost=20.50..3157.58 rows=100000 width=32) (actual time=
> 1.846..102.264 rows=100000 loops=1)
> Hash Cond: ((b.x = (a.x)::numeric) AND (b.y = (a.y)::numeric) AND (b.z =
> (a.z)::numeric))
> -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual time=
> 0.041..15.328 rows=100000 loops=1)
> -> Hash (cost=10.00..10.00 rows=600 width=12) (actual time=1.780..1.781
> rows=600 loops=1)
> Buckets: 1024 Batches: 1 Memory Usage: 34kB
> -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.031..0.252 rows=600 loops=1)
> Planning Time: 0.492 ms
> Execution Time: 107.609 ms
> (8 rows)
>
> SET
> QUERY
> PLAN
> ----------------------------------------------------------------------------------------------------------------------
> Hash Right Join (cost=3387.00..8500.08 rows=100000 width=32) (actual time=
> 80.919..101.613 rows=100000 loops=1)
> Hash Cond: (((a.x)::numeric = b.x) AND ((a.y)::numeric = b.y) AND
> ((a.z)::numeric = b.z))
> -> Seq Scan on a (cost=0.00..10.00 rows=600 width=12) (actual time=
> 0.017..0.084 rows=600 loops=1)
> -> Hash (cost=1637.00..1637.00 rows=100000 width=20) (actual time=
> 80.122..80.123 rows=100000 loops=1)
> Buckets: 131072 Batches: 1 Memory Usage: 6474kB
> -> Seq Scan on b (cost=0.00..1637.00 rows=100000 width=20) (actual
> time=0.015..11.819 rows=100000 loops=1)
> Planning Time: 0.194 ms
> Execution Time: 104.662 ms
> (8 rows)
>
> --
> Regards,
> Alena Rybakina
> Postgres Professional
>
> diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
> index ef475d95a18..31771dfba46 100644
> --- a/src/backend/optimizer/path/costsize.c
> +++ b/src/backend/optimizer/path/costsize.c
> @@ -153,6 +153,7 @@ bool enable_parallel_hash = true;
> bool enable_partition_pruning = true;
> bool enable_presorted_aggregate = true;
> bool enable_async_append = true;
> +bool enable_cost_size = true;
>
> typedef struct
> {
> @@ -4033,11 +4034,22 @@ final_cost_hashjoin(PlannerInfo *root, HashPath *path,
> thismcvfreq = restrictinfo->left_mcvfreq;
> }
>
> + if (enable_cost_size)
> + {
> + innerbucketsize *= thisbucketsize;
> + innermcvfreq *= thismcvfreq;
> + }
> + else
> + {
> if (innerbucketsize > thisbucketsize)
> innerbucketsize = thisbucketsize;
> if (innermcvfreq > thismcvfreq)
> innermcvfreq = thismcvfreq;
> + }
> }
> +
> + if (enable_cost_size && innerbucketsize > virtualbuckets)
> + innerbucketsize = 1.0 / virtualbuckets;
> }
>
> /*
> diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
> index 71e27f8eb05..ded9ba3b7a9 100644
> --- a/src/backend/utils/misc/guc_tables.c
> +++ b/src/backend/utils/misc/guc_tables.c
> @@ -1007,6 +1007,19 @@ struct config_bool ConfigureNamesBool[] =
> true,
> NULL, NULL, NULL
> },
> + {
> + {"enable_cost_size", PGC_USERSET, QUERY_TUNING_OTHER,
> + gettext_noop("set the optimizer coefficient"
> + "so that custom or generic plan is selected more often. "
> + "by default, the value is set to 1, which means that "
> + "the choice of using both depends on the calculated cost"),
> + NULL,
> + GUC_EXPLAIN
> + },
> + &enable_cost_size,
> + true,
> + NULL, NULL, NULL
> + },
> {
> {"enable_async_append", PGC_USERSET, QUERY_TUNING_METHOD,
> gettext_noop("Enables the planner's use of async append plans."),
> diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
> index 6cf49705d3a..c79ec12e6d5 100644
> --- a/src/include/optimizer/cost.h
> +++ b/src/include/optimizer/cost.h
> @@ -71,6 +71,7 @@ extern PGDLLIMPORT bool enable_partition_pruning;
> extern PGDLLIMPORT bool enable_presorted_aggregate;
> extern PGDLLIMPORT bool enable_async_append;
> extern PGDLLIMPORT int constraint_exclusion;
> +extern PGDLLIMPORT bool enable_cost_size;
>
> extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
> double index_pages, PlannerInfo *root);
--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EDB https://enterprisedb.com
Only you can decide what is important to you.
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2023-09-07 18:42:00 | Re: A minor adjustment to get_cheapest_path_for_pathkeys |
Previous Message | Bruce Momjian | 2023-09-07 17:52:45 | Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption? |