Drastic select count performance hit when jsonb GIN indices are present

From: Anton Melser <melser(dot)anton(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Drastic select count performance hit when jsonb GIN indices are present
Date: 2014-12-26 12:36:07
Message-ID: CAKywjPpKQ6xkjNij6GXJx=nKm2V55uVn1hfNp801OCDYEyiYtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I've been playing with jsonb for storing and querying data from the Mailgun
Events API (http://documentation.mailgun.com/api-events.html#examples) I
already have a system that parses the JSON to csv and loads into standard
tables but what better way to spend the holidays than nerding out on new
postgres tech :-)? I am using the official postgres-supplied 9.4 Ubuntu
repo on 14.04.

I have been testing both just inserting the raw API call JSON
({"items":[{"item1"....],"paging":...}) and extracting the actual events
from the items array ({"item1":...}, {"item2":...}) and inserting per-event
rather than per page of 100 events in an "items" array. I did this to try
and benefit from GIN indices but much to my surprise, adding indices when
there is a line per-event actually drastically *reduces* performance!

mgevents=# create table myevents (event jsonb);
mgevents=# create index idx_myevents on myevents using gin (event
jsonb_path_ops);
mgevents=# create index idx_myevents_no_path on myevents using gin (event);

mgevents=# select count(*) from myevents;
count
---------
3715600
(1 row)

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event ->> 'event' = 'delivered';
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=279884.45..279884.46 rows=1 width=0) (actual
time=14626.213..14626.216 rows=1 loops=1)
Buffers: shared read=448208
-> Seq Scan on myevents (cost=0.00..279838.00 rows=18578 width=0)
(actual time=0.719..11432.283 rows=1417152 loops=1)
Filter: ((event ->> 'event'::text) = 'delivered'::text)
Rows Removed by Filter: 2298448
Buffers: shared read=448208
Planning time: 0.074 ms
Execution time: 14626.955 ms
(8 rows)

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event @> '{"event": "delivered"}';
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3624.38..3624.39 rows=1 width=0) (actual
time=80984.095..80984.098 rows=1 loops=1)
Buffers: shared hit=1 read=298589
-> Bitmap Heap Scan on myevents (cost=35.80..3615.09 rows=3716
width=0) (actual time=351.510..77669.907 rows=1417152 loops=1)
Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
Heap Blocks: exact=298362
Buffers: shared hit=1 read=298589
-> Bitmap Index Scan on idx_myevents (cost=0.00..34.87 rows=3716
width=0) (actual time=257.219..257.219 rows=1417152 loops=1)
Index Cond: (event @> '{"event": "delivered"}'::jsonb)
Buffers: shared hit=1 read=227
Planning time: 3.197 ms
Execution time: 80986.340 ms
(11 rows)

mgevents=# drop index idx_myevents;

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event @> '{"event": "delivered"}';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3631.38..3631.39 rows=1 width=0) (actual
time=81898.250..81898.252 rows=1 loops=1)
Buffers: shared hit=383 read=299133
-> Bitmap Heap Scan on myevents (cost=42.80..3622.09 rows=3716
width=0) (actual time=534.816..78526.944 rows=1417152 loops=1)
Recheck Cond: (event @> '{"event": "delivered"}'::jsonb)
Heap Blocks: exact=298362
Buffers: shared hit=383 read=299133
-> Bitmap Index Scan on idx_myevents_no_path (cost=0.00..41.87
rows=3716 width=0) (actual time=453.412..453.412 rows=1417152 loops=1)
Index Cond: (event @> '{"event": "delivered"}'::jsonb)
Buffers: shared hit=383 read=771
Planning time: 2.322 ms
Execution time: 81898.578 ms
(11 rows)

mgevents=# drop index idx_myevents_no_path;

mgevents=# explain (analyze, buffers) select count(*) from myevents where
event @> '{"event": "delivered"}';
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=270558.29..270558.30 rows=1 width=0) (actual
time=19834.530..19834.532 rows=1 loops=1)
Buffers: shared hit=130935 read=317273
-> Seq Scan on myevents (cost=0.00..270549.00 rows=3716 width=0)
(actual time=4.650..16764.726 rows=1417152 loops=1)
Filter: (event @> '{"event": "delivered"}'::jsonb)
Rows Removed by Filter: 2298448
Buffers: shared hit=130935 read=317273
Planning time: 0.238 ms
Execution time: 19834.605 ms
(8 rows)

So it doesn't even appear to be a problem with the operator (->> vs @>) but
rather that the planner is using the index rather than just a scan. Now as
I'm completely new to jsonb and GIN there are probably lots of better ways
to do this and configure the server (a VM on my laptop...) but it seems
surprising that adding an index should ever drastically reduce select
performance! I also tried an index on the event object, with and without
jsonb_path_ops

mgevents=# create index idx_myevents_event_type_path on myevents using gin
((event -> 'event') jsonb_path_ops);

But it just ends up doing a scan anyway, so I'm probably doing something
wrong.

Any pointers? Thanks!
Anton

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Ulbrich 2014-12-26 13:18:25 Re: Check constraint on foreign table using SQL function
Previous Message Sameer Kumar 2014-12-26 11:55:08 Re: help troubleshooting invalid page header error