Unexpected expensive index scan

From: Jake Nielsen <jake(dot)k(dot)nielsen(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Unexpected expensive index scan
Date: 2016-09-28 00:02:43
Message-ID: CAP3LSG7iwYS6V555qqjNBvuoi3gsMhTU-nN7PcfxJhHhJVbXRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I've got a query that takes a surprisingly long time to run, and I'm having
a really rough time trying to figure it out.

Before I get started, here are the specifics of the situation:

Here is the table that I'm working with (apologies for spammy indices, I've
been throwing shit at the wall)

Table "public.syncerevent"

Column | Type | Modifiers

--------------+---------+----------------------------------------------------------

id | bigint | not null default
nextval('syncerevent_id_seq'::regclass)

userid | text |

event | text |

eventid | text |

originatorid | text |

propogatorid | text |

kwargs | text |

conflicted | integer |

Indexes:

"syncerevent_pkey" PRIMARY KEY, btree (id)

"syncereventidindex" UNIQUE, btree (eventid)

"anothersyncereventidindex" btree (userid)

"anothersyncereventidindexwithascending" btree (userid, id)

"asdfasdgasdf" btree (userid, id DESC)

"syncereventuseridhashindex" hash (userid)

To provide some context, as per the wiki,
there are 3,290,600 rows in this table.
It gets added to frequently, but never deleted from.
The "kwargs" column often contains mid-size JSON strings (roughly 30K
characters on average)
As of right now, the table has 53 users in it. About 20% of those have a
negligible number of events, but the rest of the users have a fairly even
smattering.

EXPLAIN (ANALYZE, BUFFERS) says:

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.43..1218.57 rows=4000 width=615) (actual
time=3352.390..3403.572 rows=4000 loops=1)

Buffers: shared hit=120244 read=160198

-> Index Scan using syncerevent_pkey on syncerevent
(cost=0.43..388147.29 rows=1274560 width=615) (actual
time=3352.386..3383.100 rows=4000 loops=1)

Index Cond: (id > 12468)

Filter: ((propogatorid <>
'"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text) AND (conflicted <> 1)
AND (userid = '57dc984f1c87461c0967e228'::text))

Rows Removed by Filter: 1685801

Buffers: shared hit=120244 read=160198

Planning time: 0.833 ms

Execution time: 3407.633 ms

(9 rows)

The postgres verison is: PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled
by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit

This query has gotten slower over time.

The postgres server is running on a db.m3.medium RDS instance on Amazon.

(3.75GB of ram)

(~3 GHz processor, single core)

I ran VACUUM, and ANALYZEd this table just prior to running the EXPLAIN
command.

Here are the server settings:

name | current_setting
| source

application_name | psql
| client

archive_command |
/etc/rds/dbbin/pgscripts/rds_wal_archive %p | configuration file

archive_mode | on
| configuration file

archive_timeout | 5min
| configuration file

autovacuum_analyze_scale_factor | 0.05
| configuration file

autovacuum_naptime | 30s
| configuration file

autovacuum_vacuum_scale_factor | 0.1
| configuration file

checkpoint_completion_target | 0.9
| configuration file

client_encoding | UTF8
| client

effective_cache_size | 1818912kB
| configuration file

fsync | on
| configuration file

full_page_writes | on
| configuration file

hot_standby | off
| configuration file

listen_addresses | *
| command line

lo_compat_privileges | off
| configuration file

log_checkpoints | on
| configuration file

log_directory | /rdsdbdata/log/error

Sorry for the formatting, I'm not sure of the best way to format this data
on a mailing list.

If it matters/interests you, here is my underlying confusion:

From some internet sleuthing, I've decided that having a table per user
(which would totally make this problem a non-issue) isn't a great idea.
Because there is a file per table, having a table per user would not scale.
My next thought was partial indexes (which would also totally help), but
since there is also a table per index, this really doesn't side-step the
problem. My rough mental model says: If there exists a way that a
table-per-user scheme would make this more efficient, then there should
also exist an index that could achieve the same effect (or close enough to
not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
utilize at least one of the two indexes on the userId column, but clearly
I'm not understanding something.

Any help in making this query more efficient would be greatly appreciated,
and any conceptual insights would be extra awesome.

Thanks for reading.

-Jake

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Jake Nielsen 2016-09-28 00:21:40 Re: Unexpected expensive index scan
Previous Message Karl Denninger 2016-09-27 22:15:26 Re: PostgreSQL on ZFS: performance tuning