Re: Specific query taking time to process

From: Fahiz Mohamed <fahiz(at)netwidz(dot)com>
To: Michael Lewis <mlewis(at)entrata(dot)com>
Cc: Pgsql Performance <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: Specific query taking time to process
Date: 2019-12-11 19:53:57
Message-ID: 7cb455c5-dd6a-4f33-bdbe-92033f7986e2@Spark
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

There is a slight different in both instance’s data. Inastanbce 1 contains latest data and instance 2 consists of data which is 3 weeks older than instance 1.

Following are the number of rows in each table in both instances

Instance 1
alf_node : 96493129 rows
alf_node_properties : 455599288 rows
alf_node_aspects : 150153006 rowsInstance 2
alf_node : 90831396 rows
alf_node_properties : 440792247 rows
alf_node_aspects : 146648241 rows

I hope the above data difference can make a drastic difference. Please correct me if I am wrong.

I checked "pg_stat_user_tables" and autovacuum never run against those tables.

I did execute vacuum manually and I noticed the below in the output

"INFO: vacuuming "public.alf_node_aspects"
INFO: "alf_node_aspects": found 0 removable, 150264654 nonremovable row versions in 812242 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU 13.53s/33.35u sec elapsed 77.88 sec.

INFO: analyzing "public.alf_node_aspects"
INFO: "alf_node_aspects": scanned 30000 of 812242 pages, containing 5550000 live rows and 0 dead rows; 30000 rows in sample, 150264770 estimated total rows"

Did the vacuum worked fine?

I did "explain( analyze, buffers )" below are the results:

Instance 1:

 Nested Loop Semi Join (cost=441133.07..1352496.76 rows=1 width=8) (actual time=9054.324..10029.748 rows=22 loops=1)
Buffers: shared hit=1617812
-> Merge Semi Join (cost=441132.50..1352359.19 rows=41 width=16) (actual time=9054.296..10029.547 rows=22 loops=1)
Merge Cond: (node.id = prop.node_id)
Buffers: shared hit=1617701
-> Index Only Scan using idx_alf_node_tqn on alf_node node (cost=441041.93..1340831.48 rows=4593371 width=8) (actual t
ime=4.418..7594.637 rows=40998148 loops=1)
Index Cond: ((type_qname_id = 240) AND (store_id = 6))
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 130
Heap Fetches: 0
Buffers: shared hit=1617681
SubPlan 1
-> Bitmap Heap Scan on alf_node_aspects aspect_1 (cost=3418.63..440585.22 rows=182459 width=8) (actual time=0.
583..2.992 rows=4697 loops=1)
Recheck Cond: (qname_id = 251)
Heap Blocks: exact=1774
Buffers: shared hit=1791
-> Bitmap Index Scan on fk_alf_nasp_qn (cost=0.00..3373.01 rows=182459 width=0) (actual time=0.384..0.38
4 rows=4697 loops=1)
Index Cond: (qname_id = 251)
Buffers: shared hit=17
-> Index Only Scan using idx_alf_nprop_s on alf_node_properties prop (cost=0.70..41.74 rows=852 width=8) (actual time=
0.022..0.037 rows=23 loops=1)
Index Cond: ((qname_id = '242'::bigint) AND (string_value = 'E292432'::text))
Heap Fetches: 0
Buffers: shared hit=20
-> Index Only Scan using alf_node_aspects_pkey on alf_node_aspects aspect (cost=0.57..38.80 rows=14 width=8) (actual time=0.
007..0.007 rows=1 loops=22)
Index Cond: ((node_id = node.id) AND (qname_id = 245))
Heap Fetches: 22
Buffers: shared hit=111
Planning time: 0.621 ms
Execution time: 10029.799 ms
(29 rows)

Instance 2:

Nested Loop Semi Join (cost=6471.94..173560891.23 rows=2 width=8) (actual time=0.162..0.470 rows=17 loops=1)
Buffers: shared hit=257
-> Nested Loop (cost=6471.37..173560734.51 rows=45 width=16) (actual time=0.154..0.392 rows=17 loops=1)
Buffers: shared hit=172
-> HashAggregate (cost=3508.15..3516.80 rows=865 width=8) (actual time=0.039..0.043 rows=18 loops=1)
Group Key: prop.node_id
Buffers: shared hit=23
-> Index Only Scan using idx_alf_nprop_s on alf_node_properties prop (cost=0.70..3505.99 rows=866 width=8)
(actual time=0.019..0.033 rows=18 loops=1)
Index Cond: ((qname_id = '242'::bigint) AND (string_value = 'E292432'::text))
Heap Fetches: 18
Buffers: shared hit=23
-> Index Scan using alf_node_pkey on alf_node node (cost=2963.22..200644.17 rows=1 width=8) (actual time=0.019..
0.019 rows=1 loops=18)
Index Cond: (id = prop.node_id)
Filter: ((type_qname_id <> 145) AND (store_id = 6) AND (type_qname_id = 240) AND (NOT (SubPlan 1)))
Rows Removed by Filter: 0
Buffers: shared hit=149
SubPlan 1
-> Materialize (cost=2962.65..397913.00 rows=158204 width=8) (actual time=0.002..0.009 rows=85 loops=17)
Buffers: shared hit=59
-> Bitmap Heap Scan on alf_node_aspects aspect_1 (cost=2962.65..396503.98 rows=158204 width=8) (ac
tual time=0.025..0.085 rows=85 loops=1)
Recheck Cond: (qname_id = 251)
Heap Blocks: exact=55
Buffers: shared hit=59
-> Bitmap Index Scan on fk_alf_nasp_qn (cost=0.00..2923.10 rows=158204 width=0) (actual time
=0.019..0.019 rows=87 loops=1)
Index Cond: (qname_id = 251)
Buffers: shared hit=4
-> Index Only Scan using alf_node_aspects_pkey on alf_node_aspects aspect (cost=0.57..34.32 rows=12 width=8) (actual t
ime=0.004..0.004 rows=1 loops=17)
Index Cond: ((node_id = node.id) AND (qname_id = 245))
Heap Fetches: 17
Buffers: shared hit=85
Planning time: 0.642 ms
Execution time: 0.546 ms
(32 rows)

With the vacuum and reindex on those tables manage to reduce the time to 10s from 30s, but still I don't get the exact performance as Instance 2.

work_mem was set to 4mb previously and in instance 1 its set to 160mb and instance 2 still using 4mb as work_mem.

shared_buffers set to 8GB in both.

Both are on Postgres DB version 9.6.9

Thanks in advance.

Fahiz
On 10 Dec 2019, 20:15 +0000, Michael Lewis <mlewis(at)entrata(dot)com>, wrote:
> On Mon, Dec 9, 2019 at 3:39 PM Fahiz Mohamed <fahiz(at)netwidz(dot)com> wrote:
> > > I ran "explain analyse" on this query and I got following result. (We have 2 identical DB instances and they consist of same data. Instane 1 took 20+ second to process and instance 2 took less than a second)
> > >
> > > Instance 1: (This is used by regular User - More than 600,000 request a day) - The result is same even when there is no user in the server.
> > > -> Bitmap Heap Scan on alf_node node (cost=995009.97..3303978.85 rows=4565737 width=8) (actual time=3304.419..20465.551 rows=41109751 loops=1)
> > > Recheck Cond: ((store_id = 6) AND (type_qname_id = 240))
> > > Rows Removed by Index Recheck: 54239131
> > > Filter: (NOT (hashed SubPlan 1))
> > > Rows Removed by Filter: 2816
> > > Heap Blocks: exact=24301 lossy=1875383
> > >
> > > Planning time: 0.639 ms
> > > Execution time: 22946.036 ms
> >
> > https://paquier.xyz/postgresql-2/postgres-9-4-feature-highlight-lossyexact-pages-for-bitmap-heap-scan/
> >
> > That seems like a lot of lossy blocks. As I understand it, that means the system didn't have enough work_mem to fit all the references to the individual rows which perhaps isn't surprising when it estimates it needs 4.5 million rows and ends up with 41 million.
> >
> > Do both DB instances have the same data? I ask because the two plans are rather different which makes me think that statistics about the data are not very similar. Are both configured the same, particularly for shared_buffers and work_mem, as well as the various planning cost parameters like random_page cost? If you can provide these plans again with explain( analyze, buffers ) this time? Did you check on the last time autovacuum ran in pg_stat_user_tables?
> >
> >

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Lewis 2019-12-11 20:09:19 Re: Specific query taking time to process
Previous Message Guillaume Lelarge 2019-12-11 15:23:05 Re: unexpected result for wastedbytes query after vacuum full