Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题

From: "zhang(dot)wensheng" <zhang(dot)wensheng(at)foxmail(dot)com>
To: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>, "pgsql-zh-general(at)postgresql(dot)org" <pgsql-zh-general(at)postgresql(dot)org>
Cc: 德哥 <digoal(at)126(dot)com>, 李海龙 <hailong(dot)li(at)qunar(dot)com>, "held911(at)163(dot)com" <held911(at)163(dot)com>, zhang(dot)wensheng(at)fomail(dot)com
Subject: Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Date: 2017-03-24 10:39:13
Message-ID: 8922114d-f7b0-a4f5-3546-4a75d54c6eb0@foxmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-zh-general

非常感谢!

里面有两个地方比较棘手:
1、ORDER BY latest_location::geometry <->
'0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
如果去掉排序,上一个查询计划中的 loops=40263 就可以解决,最后得到:Execution
time: 13.872 ms

2、NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r
WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state !=
'default' OR r.other_state = 'disliked'))
如果去掉这个条件,同上。

EXPLAIN (costs on,verbose on,analyze on,buffers off)
SELECT id FROM user_search.users u
WHERE
NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r
WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state !=
'default' OR r.other_state = 'disliked'))
AND status = 'default'
AND id NOT IN (SELECT UNNEST(q.*) FROM
user_search.select_contact_user_ids(36012092) AS q)
AND last_activity > current_timestamp at time zone 'UTC' - INTERVAL '5 day'
AND (looking_for_gender = 'both' OR looking_for_gender =
'female'::user_search.gender) AND ('male'::user_search.gender = 'both'
OR gender = 'male'::user_search.gender)
AND latest_location IS NOT NULL

AND
birthdate BETWEEN DATE(current_timestamp at time zone 'UTC'
- (26 + 1|| 'years')::interval + '1 day'::interval) AND
DATE(current_timestamp at time zone 'UTC' - (20 || 'years')::interval)
AND
int4range(search_min_age,search_max_age) @>
date_part('year', age('1995-03-16'::date))::int

AND search_radius >
ST_Distance('0101000000DFE00B93A9425E409BE61DA7E8283E40'::geography,
latest_location::geography)
ORDER BY latest_location::geometry <->
'0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
LIMIT 1000;

---- =============================================
QUERY PLAN
---- =============================================
-------------------------------------------------------------------------------------------------

Limit (cost=2.04..2349.31 rows=80 width=12) (actual
time=227.526..865.751 rows=1000 loops=1)
Output: u.id, ((u.latest_location <->
'0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry))
-> Nested Loop Anti Join (cost=2.04..2349.31 rows=80 width=12)
(actual time=227.524..865.585 rows=1000 loops=1)
Output: u.id, (u.latest_location <->
'0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
-> Index Scan using dba_suggested_users_basic_idx on
user_search.users u (cost=1.48..2123.68 rows=80 width=36) (actual
time=175.352..734.580 rows=40263 loops=1)
Output: u.id, u.latest_location
Index Cond: ((int4range(u.search_min_age,
u.search_max_age) @> (date_part('year'::text,
age((('now'::cstring)::date)::timestamp with time zone,
('1995-03-16'::date)::timestamp with time zone)))::in
teger) AND (u.birthdate >= date(((timezone('UTC'::text, now()) -
('27years'::cstring)::interval) + '1 day'::interval))) AND (u.birthdate
<= date((timezone('UTC'::text, now()) - ('20years'::cstring)::interval))) A
ND (u.last_activity > (timezone('UTC'::text, now()) - '5 days'::interval)))
Order By: (u.latest_location <->
'0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
Filter: ((NOT (hashed SubPlan 1)) AND (u.gender =
'male'::user_search.gender) AND ((u.looking_for_gender =
'both'::user_search.gender) OR (u.looking_for_gender =
'female'::user_search.gender)) AND
((u.search_radius)::double precision >
_st_distance('0101000020E6100000DFE00B93A9425E409BE61DA7E8283E40'::geography,
(u.latest_location)::geography, '0'::double precision, true)))
Rows Removed by Filter: 55131
SubPlan 1
-> Function Scan on
user_search.select_contact_user_ids q (cost=0.25..0.76 rows=100
width=4) (actual time=0.235..0.237 rows=3 loops=1)
Output: unnest(q.q)
Function Call:
user_search.select_contact_user_ids(36012092)
-> Index Scan using relationships_user_id_other_user_id_idx
on rel_8192_60.relationships r (cost=0.56..2.79 rows=1 width=4) (actual
time=0.003..0.003 rows=1 loops=40263)
Output: r.other_user_id
Index Cond: ((r.user_id = 36012092) AND (r.other_user_id
= u.id))
Filter: ((r.state <>
'default'::rel_8192_60.relationship_state) OR (r.other_state =
'disliked'::rel_8192_60.relationship_state))
Planning time: 0.956 ms
Execution time: 866.176 ms
(20 rows)

On 03/24/2017 05:54 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
> 有完整的SQL语句吗?按照你的描述,不应该走nested loops,而应该走anti hash比较高效。
>
> -----邮件原件-----
> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
> 发送时间: 2017年3月24日 17:42
> 收件人: pgsql-zh-general(at)postgresql(dot)org
> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>; 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>; held911(at)163(dot)com
> 主题: 请教集合A中快速排除集合B,两个大表JOIN的问题
>
> hi~各位老师们:
>
> 我们线上有一个这样的查询,A表数据量约7300万,B表约近千万:
>
> SELECT A.p_key
> FROM A
> WHERE NOT EXISTS
> (SELECT 1
> FROM B
> WHERE B.f_key = A.pkey)
> LIMIT 1000;
>
> A和B经过条件过滤之后结果集都很大,A在条件过滤后在几十万到上百万不等,B经过条件过滤后会在0到几十万,现在想快速从A中把所有B的结果排除掉
>
> 这是我一个测试环境的查询计划:
>
> Limit (cost=2.04..18.76 rows=1 width=12) (actual time=0.987..512.000
> rows=1000 loops=1)
> -> Nested Loop Anti Join (cost=2.04..18.76 rows=1 width=12) (actual time=0.987..511.811 rows=1000 loops=1)
> -> Index Scan using dba_users_male_idx_1 on users u
> (cost=1.48..13.74 rows=1 width=36) (actual time=0.959..396.402
> rows=38109 loops=1)
> SubPlan 1
> -> Function Scan on select_contact_user_ids q
> (cost=0.25..0.76 rows=100 width=4) (actual time=0.177..0.178 rows=3 loops=1)
> -> Index Scan using relationships_user_id_other_user_id_idx
> on relationships r (cost=0.56..2.79 rows=1 width=4) (actual
> time=0.003..0.003 rows=1 loops=38109)
> Planning time: 2.217 ms
> Execution time: 512.458 ms
>
> 在9.6中支持了bloom filter,我创建了这样的索引,但是性能远不如UNIQUE
> BTREE索引:
> CREATE INDEX ON B USING bloom(id,f_key) WHERE state != 'default' OR other_state = 'disliked';
>
> 目前500多ms的性能基本不能接受。不知道各位有什么好方法解决这个问题?
>
> ------------------
> 张文升
>
>
> ********************************************************************************************************************************
> The information in this email is confidential and may be legally privileged. If you have received this email in error or are not the intended recipient, please immediately notify the sender and delete this message from your computer. Any use, distribution, or copying of this email other than by the intended recipient is strictly prohibited. All messages sent to and from us may be monitored to ensure compliance with internal policies and to protect our business.
> Emails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, lost or destroyed, or contain viruses. Anyone who communicates with us by email is taken to accept these risks.
>
> 收发邮件者请注意:
> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
> ********************************************************************************************************************************

In response to

Browse pgsql-zh-general by date

  From Date Subject
Next Message zhang.wensheng 2017-03-24 11:43:02 Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Previous Message zhang.wensheng 2017-03-24 10:37:21 Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题