Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题

From: "zhang(dot)wensheng" <zhang(dot)wensheng(at)foxmail(dot)com>
To: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>, "pgsql-zh-general(at)postgresql(dot)org" <pgsql-zh-general(at)postgresql(dot)org>
Cc: 德哥 <digoal(at)126(dot)com>, 李海龙 <hailong(dot)li(at)qunar(dot)com>, "zhang(dot)wensheng(at)fomail(dot)com" <zhang(dot)wensheng(at)fomail(dot)com>
Subject: Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Date: 2017-03-24 11:43:02
Message-ID: 88ab1cff-e397-5398-3657-8c3891214327@foxmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-zh-general

邮件内容是空的,被公司截断了吧 :-)

On 03/24/2017 07:15 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>
> -----邮件原件-----
> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
> 发送时间: 2017年3月24日 18:37
> 收件人: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>; pgsql-zh-general(at)postgresql(dot)org
> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>; held911(at)163(dot)com; zhang(dot)wensheng(at)fomail(dot)com
> 主题: Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
>
> 里面有两个地方比较棘手:
> 1、ORDER BY latest_location::geometry <-> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
> 如果去掉排序,上一个查询计划中的 loops=40263 就可以解决,最后得到:Execution
> time: 13.872 ms
> [汪洋] 这部分确实消耗时间,可以看出大部分时间消耗在排序上。主要是所有排完序LIMIT 1000才能应用。如果只是想在头1000行之内排序,可以先把排序拿掉,使用WITH子句。最后再排序。
> 如果必须所有排完序再拿头1000,看来这时间省不了。
>
> 2、NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state != 'default' OR r.other_state = 'disliked'))
> 如果去掉这个条件,很明显就不会有loop了。
> [汪洋] 其实这个索引效率还是非常高的,只不过loop次数太多。虽然限制了1000,但还是loop了40263次。和上面一样,可以先使用WITH把结果集限制在1000后再进行这个子查询,减少loop次数。
> 关键还是排序是大问题。
>
> 但是这两个条件都不能去掉。
>
> 我在users表上的索引:
> CREATE INDEX dba_suggested_users_basic_idx ON user_search.users USING
> GIST(int4range(search_min_age,search_max_age),birthdate,latest_location,last_activity)
> WHERE status = 'default'::user_search.status AND latest_location IS NOT NULL;
> 因为有一个latest_location是postgis的geometry类型,所以必须用gist类型的索引,但是如果只单一使用一个字段的索引,查询效率非常低,所以我使用了btree_gist,并对表中的两个字段做了range类型,建了上述的联合索引。
>
>
> On 03/24/2017 06:25 PM, zhang.wensheng wrote:
>> 非常感谢!附件中也有一份。
>>
>> EXPLAIN (costs on,verbose on,analyze on,buffers off) SELECT id FROM
>> user_search.users u WHERE NOT EXISTS (SELECT r.other_user_id FROM
>> rel_8192_60.relationships r WHERE r.user_id = 36012092 AND
>> r.other_user_id = u.id AND ( r.state != 'default' OR r.other_state =
>> 'disliked')) AND status = 'default'
>> AND id NOT IN (SELECT UNNEST(q.*) FROM
>> user_search.select_contact_user_ids(36012092) AS q) AND last_activity
>>> current_timestamp at time zone 'UTC' - INTERVAL '5 day'
>> AND (looking_for_gender = 'both' OR looking_for_gender =
>> 'female'::user_search.gender) AND ('male'::user_search.gender = 'both'
>> OR gender = 'male'::user_search.gender) AND latest_location IS NOT
>> NULL
>>
>> AND
>> birthdate BETWEEN DATE(current_timestamp at time zone
>> 'UTC' - (26 + 1|| 'years')::interval + '1 day'::interval) AND
>> DATE(current_timestamp at time zone 'UTC' - (20 || 'years')::interval)
>> AND
>> int4range(search_min_age,search_max_age) @>
>> date_part('year', age('1995-03-16'::date))::int
>>
>> AND search_radius >
>> ST_Distance('0101000000DFE00B93A9425E409BE61DA7E8283E40'::geography,
>> latest_location::geography)
>> ORDER BY latest_location::geometry <->
>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC LIMIT 1000;
>>
>> ---- =============================================
>> QUERY PLAN
>> ---- =============================================
>> ----------------------------------------------------------------------
>> ---------------------------
>>
>> Limit (cost=2.04..2349.31 rows=80 width=12) (actual
>> time=227.526..865.751 rows=1000 loops=1)
>> Output: u.id, ((u.latest_location <->
>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry))
>> -> Nested Loop Anti Join (cost=2.04..2349.31 rows=80 width=12)
>> (actual time=227.524..865.585 rows=1000 loops=1)
>> Output: u.id, (u.latest_location <->
>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
>> -> Index Scan using dba_suggested_users_basic_idx on
>> user_search.users u (cost=1.48..2123.68 rows=80 width=36) (actual
>> time=175.352..734.580 rows=40263 loops=1)
>> Output: u.id, u.latest_location
>> Index Cond: ((int4range(u.search_min_age,
>> u.search_max_age) @> (date_part('year'::text,
>> age((('now'::cstring)::date)::timestamp with time zone,
>> ('1995-03-16'::date)::timestamp with time zone)))::in
>> teger) AND (u.birthdate >= date(((timezone('UTC'::text, now()) -
>> ('27years'::cstring)::interval) + '1 day'::interval))) AND
>> (u.birthdate <= date((timezone('UTC'::text, now()) -
>> ('20years'::cstring)::interval))) A
>> ND (u.last_activity > (timezone('UTC'::text, now()) - '5
>> days'::interval)))
>> Order By: (u.latest_location <->
>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
>> Filter: ((NOT (hashed SubPlan 1)) AND (u.gender =
>> 'male'::user_search.gender) AND ((u.looking_for_gender =
>> 'both'::user_search.gender) OR (u.looking_for_gender =
>> 'female'::user_search.gender)) AND
>> ((u.search_radius)::double precision >
>> _st_distance('0101000020E6100000DFE00B93A9425E409BE61DA7E8283E40'::geo
>> graphy, (u.latest_location)::geography, '0'::double precision, true)))
>> Rows Removed by Filter: 55131
>> SubPlan 1
>> -> Function Scan on
>> user_search.select_contact_user_ids q (cost=0.25..0.76 rows=100
>> width=4) (actual time=0.235..0.237 rows=3 loops=1)
>> Output: unnest(q.q)
>> Function Call:
>> user_search.select_contact_user_ids(36012092)
>> -> Index Scan using relationships_user_id_other_user_id_idx
>> on rel_8192_60.relationships r (cost=0.56..2.79 rows=1 width=4)
>> (actual time=0.003..0.003 rows=1 loops=40263)
>> Output: r.other_user_id
>> Index Cond: ((r.user_id = 36012092) AND
>> (r.other_user_id = u.id))
>> Filter: ((r.state <>
>> 'default'::rel_8192_60.relationship_state) OR (r.other_state =
>> 'disliked'::rel_8192_60.relationship_state))
>> Planning time: 0.956 ms
>> Execution time: 866.176 ms
>> (20 rows)
>>
>>
>> On 03/24/2017 05:54 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>>> 有完整的SQL语句吗?按照你的描述,不应该走nested loops,而应该走anti
>>> hash比较高效。
>>>
>>> -----邮件原件-----
>>> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
>>> 发送时间: 2017年3月24日 17:42
>>> 收件人: pgsql-zh-general(at)postgresql(dot)org
>>> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>;
>>> 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>; held911(at)163(dot)com
>>> 主题: 请教集合A中快速排除集合B,两个大表JOIN的问题
>>>
>>> hi~各位老师们:
>>>
>>> 我们线上有一个这样的查询,A表数据量约7300万,B表约近千万:
>>>
>>> SELECT A.p_key
>>> FROM A
>>> WHERE NOT EXISTS
>>> (SELECT 1
>>> FROM B
>>> WHERE B.f_key = A.pkey)
>>> LIMIT 1000;
>>>
>>> A和B经过条件过滤之后结果集都很大,A在条件过滤后在几十万到上百万不等,B经过条件过滤后会在0到几十万,现在想快速从A中把所有B的结果排除
>>> 掉
>>>
>>>
>>> 这是我一个测试环境的查询计划:
>>>
>>> Limit (cost=2.04..18.76 rows=1 width=12) (actual
>>> time=0.987..512.000
>>> rows=1000 loops=1)
>>> -> Nested Loop Anti Join (cost=2.04..18.76 rows=1 width=12)
>>> (actual time=0.987..511.811 rows=1000 loops=1)
>>> -> Index Scan using dba_users_male_idx_1 on users u
>>> (cost=1.48..13.74 rows=1 width=36) (actual time=0.959..396.402
>>> rows=38109 loops=1)
>>> SubPlan 1
>>> -> Function Scan on select_contact_user_ids q
>>> (cost=0.25..0.76 rows=100 width=4) (actual time=0.177..0.178 rows=3
>>> loops=1)
>>> -> Index Scan using
>>> relationships_user_id_other_user_id_idx
>>> on relationships r (cost=0.56..2.79 rows=1 width=4) (actual
>>> time=0.003..0.003 rows=1 loops=38109)
>>> Planning time: 2.217 ms
>>> Execution time: 512.458 ms
>>>
>>> 在9.6中支持了bloom filter,我创建了这样的索引,但是性能远不如UNIQUE
>>> BTREE索引:
>>> CREATE INDEX ON B USING bloom(id,f_key) WHERE state != 'default' OR
>>> other_state = 'disliked';
>>>
>>> 目前500多ms的性能基本不能接受。不知道各位有什么好方法解决这个问题?
>>>
>>> ------------------
>>> 张文升
>>>
>>>
>>> *********************************************************************
>>> ***********************************************************
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. If you have received this email in error or are not the
>>> intended recipient, please immediately notify the sender and delete
>>> this message from your computer. Any use, distribution, or copying of
>>> this email other than by the intended recipient is strictly
>>> prohibited. All messages sent to and from us may be monitored to
>>> ensure compliance with internal policies and to protect our business.
>>> Emails are not secure and cannot be guaranteed to be error free as
>>> they can be intercepted, amended, lost or destroyed, or contain
>>> viruses. Anyone who communicates with us by email is taken to accept
>>> these risks.
>>>
>>> 收发邮件者请注意:
>>> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
>>>
>>> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
>>>
>>> *********************************************************************
>>> ***********************************************************
>>>
>
> ********************************************************************************************************************************
> The information in this email is confidential and may be legally privileged. If you have received this email in error or are not the intended recipient, please immediately notify the sender and delete this message from your computer. Any use, distribution, or copying of this email other than by the intended recipient is strictly prohibited. All messages sent to and from us may be monitored to ensure compliance with internal policies and to protect our business.
> Emails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, lost or destroyed, or contain viruses. Anyone who communicates with us by email is taken to accept these risks.
>
> 收发邮件者请注意:
> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
> ********************************************************************************************************************************

In response to

Responses

Browse pgsql-zh-general by date

  From Date Subject
Next Message zhang.wensheng 2017-03-24 12:12:26 Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Previous Message zhang.wensheng 2017-03-24 10:39:13 Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题