Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题

From: "zhang(dot)wensheng" <zhang(dot)wensheng(at)foxmail(dot)com>
To: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>
Cc: "pgsql-zh-general(at)postgresql(dot)org" <pgsql-zh-general(at)postgresql(dot)org>, 德哥 <digoal(at)126(dot)com>, 李海龙 <hailong(dot)li(at)qunar(dot)com>
Subject: Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
Date: 2017-03-24 13:04:54
Message-ID: 3d2c6f45-fbef-7d59-6340-353f295db3e4@foxmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-zh-general

报告一下最新进展,刚才有一个地方没有注意到:
search_radius >
ST_Distance()这里有一个对距离作运算的地方,我也改为用point类型,这样才真正达到使用lastest_location_point上的索引的目的,性能可以提升4倍左右。
Planning time: 2.031 ms
Execution time: 228.414 ms

with我想过了,可以先1000,1000取,递归。这个我明天改改尝试一下。

非常感谢!:-)

On 03/24/2017 08:55 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
> 如果排序无法避免,那么有两种方案减少loop
> 1、with语句先取出1000,再做not exists。这样最后结果无法保证是1000行,但能保证loop 1000次
> 2、with给一定buffer,例如2000,然后再做not exists 。这样最后结果可能多于1000,多了也没关系,可以再截取一层
>
> Sent from my iPhone
> -------------------------
> AODI (All Ops Do It)
> Agile Op, Dynamic Infra
>
>> On 24 Mar 2017, at 8:12 PM, zhang.wensheng <zhang(dot)wensheng(at)foxmail(dot)com> wrote:
>>
>> sorry,看到inline回复了。
>> 是这样的,这个排序是必须的,需要取最近距离的数据出来,所以不能先limit 1000再排序。
>> 我按照德哥一篇博客中的方法,在表中增加了一列用来把geometry类型冗余了一份point类型的,排序的性能会稍微好一些,但主要的问题还是loop上了。
>>
>>
>>> On 03/24/2017 07:43 PM, zhang.wensheng wrote:
>>> 邮件内容是空的,被公司截断了吧 :-)
>>>
>>>
>>>> On 03/24/2017 07:15 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>>>>
>>>> -----邮件原件-----
>>>> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
>>>> 发送时间: 2017年3月24日 18:37
>>>> 收件人: 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>; pgsql-zh-general(at)postgresql(dot)org
>>>> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>; held911(at)163(dot)com; zhang(dot)wensheng(at)fomail(dot)com
>>>> 主题: Re: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题
>>>>
>>>> 里面有两个地方比较棘手:
>>>> 1、ORDER BY latest_location::geometry <-> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC
>>>> 如果去掉排序,上一个查询计划中的 loops=40263 就可以解决,最后得到:Execution
>>>> time: 13.872 ms
>>>> [汪洋] 这部分确实消耗时间,可以看出大部分时间消耗在排序上。主要是所有排完序LIMIT 1000才能应用。如果只是想在头1000行之内排序,可以先把排序拿掉,使用WITH子句。最后再排序。
>>>> 如果必须所有排完序再拿头1000,看来这时间省不了。
>>>>
>>>>
>>>> 2、NOT EXISTS (SELECT r.other_user_id FROM rel_8192_60.relationships r WHERE r.user_id = 36012092 AND r.other_user_id = u.id AND ( r.state != 'default' OR r.other_state = 'disliked'))
>>>> 如果去掉这个条件,很明显就不会有loop了。
>>>> [汪洋] 其实这个索引效率还是非常高的,只不过loop次数太多。虽然限制了1000,但还是loop了40263次。和上面一样,可以先使用WITH把结果集限制在1000后再进行这个子查询,减少loop次数。
>>>> 关键还是排序是大问题。
>>>>
>>>> 但是这两个条件都不能去掉。
>>>>
>>>> 我在users表上的索引:
>>>> CREATE INDEX dba_suggested_users_basic_idx ON user_search.users USING
>>>> GIST(int4range(search_min_age,search_max_age),birthdate,latest_location,last_activity)
>>>> WHERE status = 'default'::user_search.status AND latest_location IS NOT NULL;
>>>> 因为有一个latest_location是postgis的geometry类型,所以必须用gist类型的索引,但是如果只单一使用一个字段的索引,查询效率非常低,所以我使用了btree_gist,并对表中的两个字段做了range类型,建了上述的联合索引。
>>>>
>>>>
>>>>> On 03/24/2017 06:25 PM, zhang.wensheng wrote:
>>>>> 非常感谢!附件中也有一份。
>>>>>
>>>>> EXPLAIN (costs on,verbose on,analyze on,buffers off) SELECT id FROM
>>>>> user_search.users u WHERE NOT EXISTS (SELECT r.other_user_id FROM
>>>>> rel_8192_60.relationships r WHERE r.user_id = 36012092 AND
>>>>> r.other_user_id = u.id AND ( r.state != 'default' OR r.other_state =
>>>>> 'disliked')) AND status = 'default'
>>>>> AND id NOT IN (SELECT UNNEST(q.*) FROM
>>>>> user_search.select_contact_user_ids(36012092) AS q) AND last_activity
>>>>>> current_timestamp at time zone 'UTC' - INTERVAL '5 day'
>>>>> AND (looking_for_gender = 'both' OR looking_for_gender =
>>>>> 'female'::user_search.gender) AND ('male'::user_search.gender = 'both'
>>>>> OR gender = 'male'::user_search.gender) AND latest_location IS NOT
>>>>> NULL
>>>>>
>>>>> AND
>>>>> birthdate BETWEEN DATE(current_timestamp at time zone
>>>>> 'UTC' - (26 + 1|| 'years')::interval + '1 day'::interval) AND
>>>>> DATE(current_timestamp at time zone 'UTC' - (20 || 'years')::interval)
>>>>> AND
>>>>> int4range(search_min_age,search_max_age) @>
>>>>> date_part('year', age('1995-03-16'::date))::int
>>>>>
>>>>> AND search_radius >
>>>>> ST_Distance('0101000000DFE00B93A9425E409BE61DA7E8283E40'::geography,
>>>>> latest_location::geography)
>>>>> ORDER BY latest_location::geometry <->
>>>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry ASC LIMIT 1000;
>>>>>
>>>>> ---- =============================================
>>>>> QUERY PLAN
>>>>> ---- =============================================
>>>>> ----------------------------------------------------------------------
>>>>> ---------------------------
>>>>>
>>>>> Limit (cost=2.04..2349.31 rows=80 width=12) (actual
>>>>> time=227.526..865.751 rows=1000 loops=1)
>>>>> Output: u.id, ((u.latest_location <->
>>>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry))
>>>>> -> Nested Loop Anti Join (cost=2.04..2349.31 rows=80 width=12)
>>>>> (actual time=227.524..865.585 rows=1000 loops=1)
>>>>> Output: u.id, (u.latest_location <->
>>>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
>>>>> -> Index Scan using dba_suggested_users_basic_idx on
>>>>> user_search.users u (cost=1.48..2123.68 rows=80 width=36) (actual
>>>>> time=175.352..734.580 rows=40263 loops=1)
>>>>> Output: u.id, u.latest_location
>>>>> Index Cond: ((int4range(u.search_min_age,
>>>>> u.search_max_age) @> (date_part('year'::text,
>>>>> age((('now'::cstring)::date)::timestamp with time zone,
>>>>> ('1995-03-16'::date)::timestamp with time zone)))::in
>>>>> teger) AND (u.birthdate >= date(((timezone('UTC'::text, now()) -
>>>>> ('27years'::cstring)::interval) + '1 day'::interval))) AND
>>>>> (u.birthdate <= date((timezone('UTC'::text, now()) -
>>>>> ('20years'::cstring)::interval))) A
>>>>> ND (u.last_activity > (timezone('UTC'::text, now()) - '5
>>>>> days'::interval)))
>>>>> Order By: (u.latest_location <->
>>>>> '0101000000DFE00B93A9425E409BE61DA7E8283E40'::geometry)
>>>>> Filter: ((NOT (hashed SubPlan 1)) AND (u.gender =
>>>>> 'male'::user_search.gender) AND ((u.looking_for_gender =
>>>>> 'both'::user_search.gender) OR (u.looking_for_gender =
>>>>> 'female'::user_search.gender)) AND
>>>>> ((u.search_radius)::double precision >
>>>>> _st_distance('0101000020E6100000DFE00B93A9425E409BE61DA7E8283E40'::geo
>>>>> graphy, (u.latest_location)::geography, '0'::double precision, true)))
>>>>> Rows Removed by Filter: 55131
>>>>> SubPlan 1
>>>>> -> Function Scan on
>>>>> user_search.select_contact_user_ids q (cost=0.25..0.76 rows=100
>>>>> width=4) (actual time=0.235..0.237 rows=3 loops=1)
>>>>> Output: unnest(q.q)
>>>>> Function Call:
>>>>> user_search.select_contact_user_ids(36012092)
>>>>> -> Index Scan using relationships_user_id_other_user_id_idx
>>>>> on rel_8192_60.relationships r (cost=0.56..2.79 rows=1 width=4)
>>>>> (actual time=0.003..0.003 rows=1 loops=40263)
>>>>> Output: r.other_user_id
>>>>> Index Cond: ((r.user_id = 36012092) AND
>>>>> (r.other_user_id = u.id))
>>>>> Filter: ((r.state <>
>>>>> 'default'::rel_8192_60.relationship_state) OR (r.other_state =
>>>>> 'disliked'::rel_8192_60.relationship_state))
>>>>> Planning time: 0.956 ms
>>>>> Execution time: 866.176 ms
>>>>> (20 rows)
>>>>>
>>>>>
>>>>>> On 03/24/2017 05:54 PM, 汪洋(平安科技数据库技术支持部经理室) wrote:
>>>>>> 有完整的SQL语句吗?按照你的描述,不应该走nested loops,而应该走anti
>>>>>> hash比较高效。
>>>>>>
>>>>>> -----邮件原件-----
>>>>>> 发件人: zhang.wensheng [mailto:zhang(dot)wensheng(at)foxmail(dot)com]
>>>>>> 发送时间: 2017年3月24日 17:42
>>>>>> 收件人: pgsql-zh-general(at)postgresql(dot)org
>>>>>> 抄送: 德哥 <digoal(at)126(dot)com>; 李海龙 <hailong(dot)li(at)qunar(dot)com>;
>>>>>> 汪洋(平安科技数据库技术支持部经理室) <WANGYANG102(at)pingan(dot)com(dot)cn>; held911(at)163(dot)com
>>>>>> 主题: 请教集合A中快速排除集合B,两个大表JOIN的问题
>>>>>>
>>>>>> hi~各位老师们:
>>>>>>
>>>>>> 我们线上有一个这样的查询,A表数据量约7300万,B表约近千万:
>>>>>>
>>>>>> SELECT A.p_key
>>>>>> FROM A
>>>>>> WHERE NOT EXISTS
>>>>>> (SELECT 1
>>>>>> FROM B
>>>>>> WHERE B.f_key = A.pkey)
>>>>>> LIMIT 1000;
>>>>>>
>>>>>> A和B经过条件过滤之后结果集都很大,A在条件过滤后在几十万到上百万不等,B经过条件过滤后会在0到几十万,现在想快速从A中把所有B的结果排除
>>>>>> 掉
>>>>>>
>>>>>>
>>>>>> 这是我一个测试环境的查询计划:
>>>>>>
>>>>>> Limit (cost=2.04..18.76 rows=1 width=12) (actual
>>>>>> time=0.987..512.000
>>>>>> rows=1000 loops=1)
>>>>>> -> Nested Loop Anti Join (cost=2.04..18.76 rows=1 width=12)
>>>>>> (actual time=0.987..511.811 rows=1000 loops=1)
>>>>>> -> Index Scan using dba_users_male_idx_1 on users u
>>>>>> (cost=1.48..13.74 rows=1 width=36) (actual time=0.959..396.402
>>>>>> rows=38109 loops=1)
>>>>>> SubPlan 1
>>>>>> -> Function Scan on select_contact_user_ids q
>>>>>> (cost=0.25..0.76 rows=100 width=4) (actual time=0.177..0.178 rows=3
>>>>>> loops=1)
>>>>>> -> Index Scan using
>>>>>> relationships_user_id_other_user_id_idx
>>>>>> on relationships r (cost=0.56..2.79 rows=1 width=4) (actual
>>>>>> time=0.003..0.003 rows=1 loops=38109)
>>>>>> Planning time: 2.217 ms
>>>>>> Execution time: 512.458 ms
>>>>>>
>>>>>> 在9.6中支持了bloom filter,我创建了这样的索引,但是性能远不如UNIQUE
>>>>>> BTREE索引:
>>>>>> CREATE INDEX ON B USING bloom(id,f_key) WHERE state != 'default' OR
>>>>>> other_state = 'disliked';
>>>>>>
>>>>>> 目前500多ms的性能基本不能接受。不知道各位有什么好方法解决这个问题?
>>>>>>
>>>>>> ------------------
>>>>>> 张文升
>>>>>>
>>>>>>
>>>>>> *********************************************************************
>>>>>> ***********************************************************
>>>>>>
>>>>>> The information in this email is confidential and may be legally
>>>>>> privileged. If you have received this email in error or are not the
>>>>>> intended recipient, please immediately notify the sender and delete
>>>>>> this message from your computer. Any use, distribution, or copying of
>>>>>> this email other than by the intended recipient is strictly
>>>>>> prohibited. All messages sent to and from us may be monitored to
>>>>>> ensure compliance with internal policies and to protect our business.
>>>>>> Emails are not secure and cannot be guaranteed to be error free as
>>>>>> they can be intercepted, amended, lost or destroyed, or contain
>>>>>> viruses. Anyone who communicates with us by email is taken to accept
>>>>>> these risks.
>>>>>>
>>>>>> 收发邮件者请注意:
>>>>>> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
>>>>>>
>>>>>> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
>>>>>>
>>>>>> *********************************************************************
>>>>>> ***********************************************************
>>>>>>
>>>> ********************************************************************************************************************************
>>>> The information in this email is confidential and may be legally privileged. If you have received this email in error or are not the intended recipient, please immediately notify the sender and delete this message from your computer. Any use, distribution, or copying of this email other than by the intended recipient is strictly prohibited. All messages sent to and from us may be monitored to ensure compliance with internal policies and to protect our business.
>>>> Emails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, lost or destroyed, or contain viruses. Anyone who communicates with us by email is taken to accept these risks.
>>>>
>>>> 收发邮件者请注意:
>>>> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
>>>> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
>>>> ********************************************************************************************************************************
>>>
>>>
>>>
>>>
> ********************************************************************************************************************************
> The information in this email is confidential and may be legally privileged. If you have received this email in error or are not the intended recipient, please immediately notify the sender and delete this message from your computer. Any use, distribution, or copying of this email other than by the intended recipient is strictly prohibited. All messages sent to and from us may be monitored to ensure compliance with internal policies and to protect our business.
> Emails are not secure and cannot be guaranteed to be error free as they can be intercepted, amended, lost or destroyed, or contain viruses. Anyone who communicates with us by email is taken to accept these risks.
>
> 收发邮件者请注意:
> 本邮件含涉密信息,请保守秘密,若误收本邮件,请务必通知发送人并直接删去,不得使用、传播或复制本邮件。
> 进出邮件均受到本公司合规监控。邮件可能发生被截留、被修改、丢失、被破坏或包含计算机病毒等不安全情况。
> ********************************************************************************************************************************

In response to

Browse pgsql-zh-general by date

  From Date Subject
Next Message tao tony 2017-05-09 01:37:12 slow query on multiple table join
Previous Message zhang.wensheng 2017-03-24 12:12:26 Re: [pgsql-zh-general] Re: 答复: 答复: 请教集合A中快速排除集合B,两个大表JOIN的问题