Quick Links

Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题

From:	"zhang(dot)wensheng" <zhang(dot)wensheng(at)foxmail(dot)com>
To:	汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>, "pgsql-zh-general(at)postgresql(dot)org" <pgsql-zh-general(at)postgresql(dot)org>
Cc:	德哥 <digoal(at)126(dot)com>, 李海龙 <hailong(dot)li(at)qunar(dot)com>, "zhang(dot)wensheng(at)fomail(dot)com" <zhang(dot)wensheng(at)fomail(dot)com>
Subject:	Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题
Date:	2017-03-24 12:12:26
Message-ID:	af39cc83-52de-79ca-6bf7-1fcdb879d38f@foxmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-zh-general

sorry，看到inline回复了。
是这样的，这个排序是必须的，需要取最近距离的数据出来，所以不能先limit
1000再排序。
我按照德哥一篇博客中的方法，在表中增加了一列用来把geometry类型冗余了一份point类型的，排序的性能会稍微好一些，但主要的问题还是loop上了。

On 03/24/2017 07:43 PM, zhang.wensheng wrote:
> 邮件内容是空的，被公司截断了吧 :-)
>
>
> On 03/24/2017 07:15 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>>
>> -----邮件原件-----
>> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
>> 发送时间: 2017年3月24日 18:37
>> 收件人: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>;
>> pgsql-zh-general(at)postgresql(dot)org
>> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>;
>> held911(at)163(dot)com; zhang(dot)wensheng(at)fomail(dot)com
>> 主题: Re: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题
>>
>> 里面有两个地方比较棘手：
>> 1、ORDER BY latest_location::geometry <->
>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
>> 如果去掉排序，上一个查询计划中的 loops=40263
>> 就可以解决，最后得到：Execution
>> time: 13.872 ms
>> [汪洋]
>> 这部分确实消耗时间，可以看出大部分时间消耗在排序上。主要是所有排完序LIMIT
>> 1000才能应用。如果只是想在头1000行之内排序，可以先把排序拿掉，使用WITH子句。最后再排序。
>>
>> 如果必须所有排完序再拿头1000，看来这时间省不了。
>>
>>
>> 2、NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r
>> WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state
>> != 'default' OR r.other_state = 'disliked'))
>> 如果去掉这个条件，很明显就不会有loop了。
>> [汪洋]
>> 其实这个索引效率还是非常高的，只不过loop次数太多。虽然限制了1000，但还是loop了40263次。和上面一样，可以先使用WITH把结果集限制在1000后再进行这个子查询，减少loop次数。
>> 关键还是排序是大问题。
>>
>> 但是这两个条件都不能去掉。
>>
>> 我在users表上的索引：
>> CREATE INDEX dba_suggested_users_basic_idx ON user_search.users USING
>> GIST(int4range(search_min_age,search_max_age),birthdate,latest_location,last_activity)
>>
>> WHERE status = 'default'::user_search.status AND latest_location IS
>> NOT NULL;
>> 因为有一个latest_location是postgis的geometry类型，所以必须用gist类型的索引，但是如果只单一使用一个字段的索引，查询效率非常低，所以我使用了btree_gist，并对表中的两个字段做了range类型，建了上述的联合索引。
>>
>>
>>
>> On 03/24/2017 06:25 PM, zhang.wensheng wrote:
>>> 非常感谢！附件中也有一份。
>>>
>>> EXPLAIN (costs on,verbose on,analyze on,buffers off) SELECT id FROM
>>> user_search.users u WHERE NOT EXISTS (SELECT r.other_user_id FROM
>>> rel_8192_60.relationships r WHERE r.user_id = 36012092 AND
>>> r.other_user_id = u.id AND ( r.state != 'default' OR r.other_state =
>>> 'disliked')) AND status = 'default'
>>> AND id NOT IN (SELECT UNNEST(q.*) FROM
>>> user_search.select_contact_user_ids(36012092) AS q) AND last_activity
>>>> current_timestamp at time zone 'UTC' - INTERVAL '5 day'
>>> AND (looking_for_gender = 'both' OR looking_for_gender =
>>> 'female'::user_search.gender) AND ('male'::user_search.gender = 'both'
>>> OR gender = 'male'::user_search.gender) AND latest_location IS NOT
>>> NULL
>>>
>>> AND
>>> birthdate BETWEEN DATE(current_timestamp at time zone
>>> 'UTC' - (26 + 1|| 'years')::interval + '1 day'::interval) AND
>>> DATE(current_timestamp at time zone 'UTC' - (20 || 'years')::interval)
>>> AND
>>> int4range(search_min_age,search_max_age) @>
>>> date_part('year', age('1995-03-16'::date))::int
>>>
>>> AND search_radius >
>>> ST_Distance('0101000000DFE00B93A9425E409BE61DA7E8283E40'::geography,
>>> latest_location::geography)
>>> ORDER BY latest_location::geometry <->
>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC LIMIT 1000;
>>>
>>> ---- =============================================
>>> QUERY PLAN
>>> ---- =============================================
>>> ----------------------------------------------------------------------
>>> ---------------------------
>>>
>>> Limit (cost=2.04..2349.31 rows=80 width=12) (actual
>>> time=227.526..865.751 rows=1000 loops=1)
>>> Output: u.id, ((u.latest_location <->
>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry))
>>> -> Nested Loop Anti Join (cost=2.04..2349.31 rows=80 width=12)
>>> (actual time=227.524..865.585 rows=1000 loops=1)
>>> Output: u.id, (u.latest_location <->
>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
>>> -> Index Scan using dba_suggested_users_basic_idx on
>>> user_search.users u (cost=1.48..2123.68 rows=80 width=36) (actual
>>> time=175.352..734.580 rows=40263 loops=1)
>>> Output: u.id, u.latest_location
>>> Index Cond: ((int4range(u.search_min_age,
>>> u.search_max_age) @> (date_part('year'::text,
>>> age((('now'::cstring)::date)::timestamp with time zone,
>>> ('1995-03-16'::date)::timestamp with time zone)))::in
>>> teger) AND (u.birthdate >= date(((timezone('UTC'::text, now()) -
>>> ('27years'::cstring)::interval) + '1 day'::interval))) AND
>>> (u.birthdate <= date((timezone('UTC'::text, now()) -
>>> ('20years'::cstring)::interval))) A
>>> ND (u.last_activity > (timezone('UTC'::text, now()) - '5
>>> days'::interval)))
>>> Order By: (u.latest_location <->
>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
>>> Filter: ((NOT (hashed SubPlan 1)) AND (u.gender =
>>> 'male'::user_search.gender) AND ((u.looking_for_gender =
>>> 'both'::user_search.gender) OR (u.looking_for_gender =
>>> 'female'::user_search.gender)) AND
>>> ((u.search_radius)::double precision >
>>> _st_distance('0101000020E6100000DFE00B93A9425E409BE61DA7E8283E40'::geo
>>> graphy, (u.latest_location)::geography, '0'::double precision, true)))
>>> Rows Removed by Filter: 55131
>>> SubPlan 1
>>> -> Function Scan on
>>> user_search.select_contact_user_ids q (cost=0.25..0.76 rows=100
>>> width=4) (actual time=0.235..0.237 rows=3 loops=1)
>>> Output: unnest(q.q)
>>> Function Call:
>>> user_search.select_contact_user_ids(36012092)
>>> -> Index Scan using relationships_user_id_other_user_id_idx
>>> on rel_8192_60.relationships r (cost=0.56..2.79 rows=1 width=4)
>>> (actual time=0.003..0.003 rows=1 loops=40263)
>>> Output: r.other_user_id
>>> Index Cond: ((r.user_id = 36012092) AND
>>> (r.other_user_id = u.id))
>>> Filter: ((r.state <>
>>> 'default'::rel_8192_60.relationship_state) OR (r.other_state =
>>> 'disliked'::rel_8192_60.relationship_state))
>>> Planning time: 0.956 ms
>>> Execution time: 866.176 ms
>>> (20 rows)
>>>
>>>
>>> On 03/24/2017 05:54 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>>>> 有完整的SQL语句吗？按照你的描述，不应该走nested loops，而应该走anti
>>>> hash比较高效。
>>>>
>>>> -----邮件原件-----
>>>> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
>>>> 发送时间: 2017年3月24日 17:42
>>>> 收件人: pgsql-zh-general(at)postgresql(dot)org
>>>> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>;
>>>> 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>;
>>>> held911(at)163(dot)com
>>>> 主题: 请教集合A中快速排除集合B，两个大表JOIN的问题
>>>>
>>>> hi~各位老师们：
>>>>
>>>> 我们线上有一个这样的查询,A表数据量约7300万，B表约近千万：
>>>>
>>>> SELECT A.p_key
>>>> FROM A
>>>> WHERE NOT EXISTS
>>>> (SELECT 1
>>>> FROM B
>>>> WHERE B.f_key = A.pkey)
>>>> LIMIT 1000;
>>>>
>>>> A和B经过条件过滤之后结果集都很大，A在条件过滤后在几十万到上百万不等，B经过条件过滤后会在0到几十万，现在想快速从A中把所有B的结果排除
>>>>
>>>> 掉
>>>>
>>>>
>>>> 这是我一个测试环境的查询计划：
>>>>
>>>> Limit (cost=2.04..18.76 rows=1 width=12) (actual
>>>> time=0.987..512.000
>>>> rows=1000 loops=1)
>>>> -> Nested Loop Anti Join (cost=2.04..18.76 rows=1 width=12)
>>>> (actual time=0.987..511.811 rows=1000 loops=1)
>>>> -> Index Scan using dba_users_male_idx_1 on users u
>>>> (cost=1.48..13.74 rows=1 width=36) (actual time=0.959..396.402
>>>> rows=38109 loops=1)
>>>> SubPlan 1
>>>> -> Function Scan on select_contact_user_ids q
>>>> (cost=0.25..0.76 rows=100 width=4) (actual time=0.177..0.178 rows=3
>>>> loops=1)
>>>> -> Index Scan using
>>>> relationships_user_id_other_user_id_idx
>>>> on relationships r (cost=0.56..2.79 rows=1 width=4) (actual
>>>> time=0.003..0.003 rows=1 loops=38109)
>>>> Planning time: 2.217 ms
>>>> Execution time: 512.458 ms
>>>>
>>>> 在9.6中支持了bloom filter，我创建了这样的索引，但是性能远不如UNIQUE
>>>> BTREE索引：
>>>> CREATE INDEX ON B USING bloom(id,f_key) WHERE state != 'default' OR
>>>> other_state = 'disliked';
>>>>
>>>> 目前500多ms的性能基本不能接受。不知道各位有什么好方法解决这个问题？
>>>>
>>>> ------------------
>>>> 张文升
>>>>
>>>>
>>>> *********************************************************************
>>>> ***********************************************************
>>>>
>>>> The information in this email is confidential and may be legally
>>>> privileged. If you have received this email in error or are not the
>>>> intended recipient, please immediately notify the sender and delete
>>>> this message from your computer. Any use, distribution, or copying of
>>>> this email other than by the intended recipient is strictly
>>>> prohibited. All messages sent to and from us may be monitored to
>>>> ensure compliance with internal policies and to protect our business.
>>>> Emails are not secure and cannot be guaranteed to be error free as
>>>> they can be intercepted, amended, lost or destroyed, or contain
>>>> viruses. Anyone who communicates with us by email is taken to accept
>>>> these risks.
>>>>
>>>> 收发邮件者请注意：
>>>> 本邮件含涉密信息，请保守秘密，若误收本邮件，请务必通知发送人并直接删去，不得使用、传播或复制本邮件。
>>>>
>>>>
>>>> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
>>>>
>>>>
>>>> *********************************************************************
>>>> ***********************************************************
>>>>
>>
>> ********************************************************************************************************************************
>>
>> The information in this email is confidential and may be legally
>> privileged. If you have received this email in error or are not the
>> intended recipient, please immediately notify the sender and delete
>> this message from your computer. Any use, distribution, or copying of
>> this email other than by the intended recipient is strictly
>> prohibited. All messages sent to and from us may be monitored to
>> ensure compliance with internal policies and to protect our business.
>> Emails are not secure and cannot be guaranteed to be error free as
>> they can be intercepted, amended, lost or destroyed, or contain
>> viruses. Anyone who communicates with us by email is taken to accept
>> these risks.
>>
>> 收发邮件者请注意：
>> 本邮件含涉密信息，请保守秘密，若误收本邮件，请务必通知发送人并直接删去，不得使用、传播或复制本邮件。
>>
>> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
>>
>> ********************************************************************************************************************************
>>
>
>
>
>
>

In response to

Re: 答复: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题 at 2017-03-24 11:43:02 from zhang.wensheng

Responses

Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题 at 2017-03-24 13:04:54 from zhang.wensheng

Browse pgsql-zh-general by date

	From	Date	Subject
Next Message	zhang.wensheng	2017-03-24 13:04:54	Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题
Previous Message	zhang.wensheng	2017-03-24 11:43:02	Re: 答复: 答复: 请教集合A中快速排除集合B，两个大表JOIN的问题