Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题

From: "zhang(dot)wensheng" <zhang(dot)wensheng(at)foxmail(dot)com>
To: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>, "pgsql-zh-general(at)postgresql(dot)org" <pgsql-zh-general(at)postgresql(dot)org>
Cc: 德哥 <digoal(at)126(dot)com>, 李海龙 <hailong(dot)li(at)qunar(dot)com>, "held911(at)163(dot)com" <held911(at)163(dot)com>, zhang(dot)wensheng(at)fomail(dot)com
Subject: Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Date: 2017-03-24 10:37:21
Message-ID: 5dfdcc81-17d4-8d6f-b56b-71c90f338288@foxmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-zh-general

里面有两个地方比较棘手:
1、ORDER BY latest_location::geometry <->
'0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
如果去掉排序,上一个查询计划中的 loops=40263 就可以解决,最后得到:Execution
time: 13.872 ms

2、NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r
WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state !=
'default' OR r.other_state = 'disliked'))
如果去掉这个条件,很明显就不会有loop了。

但是这两个条件都不能去掉。

我在users表上的索引:
CREATE INDEX dba_suggested_users_basic_idx ON user_search.users USING
GIST(int4range(search_min_age,search_max_age),birthdate,latest_location,last_activity)
WHERE status = 'default'::user_search.status AND latest_location IS NOT
NULL;
因为有一个latest_location是postgis的geometry类型,所以必须用gist类型的索引,但是如果只单一使用一个字段的索引,查询效率非常低,所以我使用了btree_gist,并对表中的两个字段做了range类型,建了上述的联合索引。

On 03/24/2017 06:25 PM, zhang.wensheng wrote:
> 非常感谢!附件中也有一份。
>
> EXPLAIN (costs on,verbose on,analyze on,buffers off)
> SELECT id FROM user_search.users u
> WHERE
> NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r
> WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state !=
> 'default' OR r.other_state = 'disliked'))
> AND status = 'default'
> AND id NOT IN (SELECT UNNEST(q.*) FROM
> user_search.select_contact_user_ids(36012092) AS q)
> AND last_activity > current_timestamp at time zone 'UTC' - INTERVAL '5
> day'
> AND (looking_for_gender = 'both' OR looking_for_gender =
> 'female'::user_search.gender) AND ('male'::user_search.gender = 'both'
> OR gender = 'male'::user_search.gender)
> AND latest_location IS NOT NULL
>
> AND
> birthdate BETWEEN DATE(current_timestamp at time zone
> 'UTC' - (26 + 1|| 'years')::interval + '1 day'::interval) AND
> DATE(current_timestamp at time zone 'UTC' - (20 || 'years')::interval)
> AND
> int4range(search_min_age,search_max_age) @>
> date_part('year', age('1995-03-16'::date))::int
>
> AND search_radius >
> ST_Distance('0101000000DFE00B93A9425E409BE61DA7E8283E40'::geography,
> latest_location::geography)
> ORDER BY latest_location::geometry <->
> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
> LIMIT 1000;
>
> ---- =============================================
> QUERY PLAN
> ---- =============================================
> -------------------------------------------------------------------------------------------------
>
> Limit (cost=2.04..2349.31 rows=80 width=12) (actual
> time=227.526..865.751 rows=1000 loops=1)
> Output: u.id, ((u.latest_location <->
> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry))
> -> Nested Loop Anti Join (cost=2.04..2349.31 rows=80 width=12)
> (actual time=227.524..865.585 rows=1000 loops=1)
> Output: u.id, (u.latest_location <->
> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
> -> Index Scan using dba_suggested_users_basic_idx on
> user_search.users u (cost=1.48..2123.68 rows=80 width=36) (actual
> time=175.352..734.580 rows=40263 loops=1)
> Output: u.id, u.latest_location
> Index Cond: ((int4range(u.search_min_age,
> u.search_max_age) @> (date_part('year'::text,
> age((('now'::cstring)::date)::timestamp with time zone,
> ('1995-03-16'::date)::timestamp with time zone)))::in
> teger) AND (u.birthdate >= date(((timezone('UTC'::text, now()) -
> ('27years'::cstring)::interval) + '1 day'::interval))) AND
> (u.birthdate <= date((timezone('UTC'::text, now()) -
> ('20years'::cstring)::interval))) A
> ND (u.last_activity > (timezone('UTC'::text, now()) - '5
> days'::interval)))
> Order By: (u.latest_location <->
> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
> Filter: ((NOT (hashed SubPlan 1)) AND (u.gender =
> 'male'::user_search.gender) AND ((u.looking_for_gender =
> 'both'::user_search.gender) OR (u.looking_for_gender =
> 'female'::user_search.gender)) AND
> ((u.search_radius)::double precision >
> _st_distance('0101000020E6100000DFE00B93A9425E409BE61DA7E8283E40'::geography,
> (u.latest_location)::geography, '0'::double precision, true)))
> Rows Removed by Filter: 55131
> SubPlan 1
> -> Function Scan on
> user_search.select_contact_user_ids q (cost=0.25..0.76 rows=100
> width=4) (actual time=0.235..0.237 rows=3 loops=1)
> Output: unnest(q.q)
> Function Call:
> user_search.select_contact_user_ids(36012092)
> -> Index Scan using relationships_user_id_other_user_id_idx
> on rel_8192_60.relationships r (cost=0.56..2.79 rows=1 width=4)
> (actual time=0.003..0.003 rows=1 loops=40263)
> Output: r.other_user_id
> Index Cond: ((r.user_id = 36012092) AND
> (r.other_user_id = u.id))
> Filter: ((r.state <>
> 'default'::rel_8192_60.relationship_state) OR (r.other_state =
> 'disliked'::rel_8192_60.relationship_state))
> Planning time: 0.956 ms
> Execution time: 866.176 ms
> (20 rows)
>
>
> On 03/24/2017 05:54 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>> 有完整的SQL语句吗?按照你的描述,不应该走nested loops,而应该走anti
>> hash比较高效。
>>
>> -----邮件原件-----
>> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
>> 发送时间: 2017年3月24日 17:42
>> 收件人: pgsql-zh-general(at)postgresql(dot)org
>> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>;
>> 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>;
>> held911(at)163(dot)com
>> 主题: 请教集合A中快速排除集合B,两个大表JOIN的问题
>>
>> hi~各位老师们:
>>
>> 我们线上有一个这样的查询,A表数据量约7300万,B表约近千万:
>>
>> SELECT A.p_key
>> FROM A
>> WHERE NOT EXISTS
>> (SELECT 1
>> FROM B
>> WHERE B.f_key = A.pkey)
>> LIMIT 1000;
>>
>> A和B经过条件过滤之后结果集都很大,A在条件过滤后在几十万到上百万不等,B经过条件过滤后会在0到几十万,现在想快速从A中把所有B的结果排除掉
>>
>>
>> 这是我一个测试环境的查询计划:
>>
>> Limit (cost=2.04..18.76 rows=1 width=12) (actual time=0.987..512.000
>> rows=1000 loops=1)
>> -> Nested Loop Anti Join (cost=2.04..18.76 rows=1 width=12)
>> (actual time=0.987..511.811 rows=1000 loops=1)
>> -> Index Scan using dba_users_male_idx_1 on users u
>> (cost=1.48..13.74 rows=1 width=36) (actual time=0.959..396.402
>> rows=38109 loops=1)
>> SubPlan 1
>> -> Function Scan on select_contact_user_ids q
>> (cost=0.25..0.76 rows=100 width=4) (actual time=0.177..0.178 rows=3
>> loops=1)
>> -> Index Scan using relationships_user_id_other_user_id_idx
>> on relationships r (cost=0.56..2.79 rows=1 width=4) (actual
>> time=0.003..0.003 rows=1 loops=38109)
>> Planning time: 2.217 ms
>> Execution time: 512.458 ms
>>
>> 在9.6中支持了bloom filter,我创建了这样的索引,但是性能远不如UNIQUE
>> BTREE索引:
>> CREATE INDEX ON B USING bloom(id,f_key) WHERE state != 'default' OR
>> other_state = 'disliked';
>>
>> 目前500多ms的性能基本不能接受。不知道各位有什么好方法解决这个问题?
>>
>> ------------------
>> 张文升
>>
>>
>> ********************************************************************************************************************************
>>
>> The information in this email is confidential and may be legally
>> privileged. If you have received this email in error or are not the
>> intended recipient, please immediately notify the sender and delete
>> this message from your computer. Any use, distribution, or copying of
>> this email other than by the intended recipient is strictly
>> prohibited. All messages sent to and from us may be monitored to
>> ensure compliance with internal policies and to protect our business.
>> Emails are not secure and cannot be guaranteed to be error free as
>> they can be intercepted, amended, lost or destroyed, or contain
>> viruses. Anyone who communicates with us by email is taken to accept
>> these risks.
>>
>> 收发邮件者请注意:
>> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
>>
>> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
>>
>> ********************************************************************************************************************************
>>
>

In response to

Responses

Browse pgsql-zh-general by date

  From Date Subject
Next Message zhang.wensheng 2017-03-24 10:39:13 Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Previous Message zhang.wensheng 2017-03-24 10:25:02 Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题