Random slow queries

From: devel(dot)brain99(at)xoxy(dot)net
To: pgsql-performance(at)postgresql(dot)org
Subject: Random slow queries
Date: 2016-06-29 01:24:14
Message-ID: CACjsgDm+n0s=5cMwGF7nRMXS8fX6CF9t1HsOr_M2iejpxKgS1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi,

I have a weird slow query issue I can't seem to find the cause of, so I'm
hoping someone here can shed some light on this.

Context:
I have an application which (among other things) implements a sort of job
queue using postgres for persistence. While I know a RDBMS is not necessarily
the most ideal tool for this, there are some constraints and requirements
which make including a messaging/queueing middleware less than trivial.
To reduce locking issues we make use of advisory locks and CTE queries to
select jobs from the queue (inspired by
https://github.com/chanks/que/blob/master/lib/que/sql.rb) as well as some
explicit or optimistic locking during job processing. As the volume of jobs
to process isn't extremely high, performance seems very good - except for the
issue I'm describing below.

The problem:
Most queries execute fast, however sometimes queries on the job table (which
contains the job queue) take exactly 122 seconds (+ 0-50ms) to execute for no
clear reason. I have set log_min_duration_statement to 200ms, and the only
queries that are logged are the 2-minute queries. I have also enabled
log_lock_waits, but no lock waits are logged. Whereas extremely high
performance is not to be expected, a query taking two minutes is way too
much - especially seeing as other concurrent queries seem to be completing
perfectly on time.
As a reference, in a time span of less than 4 minutes the application
processes a test load of 2000+ jobs going through 3 stages in the job
processing 'pipeline' - each stage using the CTE query to pick up a job. So
that's 6000+ executions of the CTE + several 1000 executions of other queries,
of which only a few (usually 1-5 per test run) inexplicably take two minutes
to complete. The offending queries seem to be quite random: SELECT ... FOR
UPDATE queries, CTE queries, and sometimes even regular SELECTs querying our
job table take inexplicably long.

PostgreSQL version:
PostgreSQL 9.3.4, compiled by Visual C++ build 1600, 64-bit

How PostgreSQL was installed: through GUI installer from EnterpriseDB

Changes made to the settings in the postgresql.conf file:
name | current_setting | source
-----------------------------+----------------------+----------------------
application_name | psql | client
client_encoding | WIN1252 | client
DateStyle | ISO, DMY | configuration file
default_text_search_config | pg_catalog.english | configuration file
lc_messages | English_Ireland.1252 | configuration file
lc_monetary | English_Ireland.1252 | configuration file
lc_numeric | English_Ireland.1252 | configuration file
lc_time | English_Ireland.1252 | configuration file
listen_addresses | * | configuration file
log_autovacuum_min_duration | 100ms | configuration file
log_destination | stderr | configuration file
log_line_prefix | %m | configuration file
log_lock_waits | on | configuration file
log_min_duration_statement | 200ms | configuration file
log_timezone | Europe/Brussels | configuration file
logging_collector | on | configuration file
max_connections | 100 | configuration file
max_stack_depth | 2MB | environment variable
port | 5432 | configuration file
shared_buffers | 128MB | configuration file
tcp_keepalives_count | 0 | configuration file
tcp_keepalives_idle | 30 | configuration file
tcp_keepalives_interval | 30 | configuration file
TimeZone | Europe/Brussels | configuration file

Operating system and version: Windows 8.1 64bit (version 6.3 build 9600)

Program used to connect to PostgreSQL: custom Java application using
Hibernate, HikariCP and the pgjdbc-ng 0.6 driver. The application runs on the
same machine as the PostgreSQL database (at least for now, during development)

Relevant or unusual information in the PostgreSQL server logs?: aside from the
slow query log, there is nothing special in the logs I think.
For example, this is the entirety of what was logged during one test run - I
just shortened the select query a bit (replacing a list of fields with *) for
readability as hibernate doesn't produce the most clean queries:
2016-06-27 22:53:06.764 CEST LOG: database system was shut down at
2016-06-27 22:53:04 CEST
2016-06-27 22:53:06.892 CEST LOG: database system is ready to accept
connections
2016-06-27 22:53:07.039 CEST LOG: autovacuum launcher started
2016-06-27 22:55:07.764 CEST LOG: automatic analyze of table
"app.public.job" system usage: CPU 0.00s/0.06u sec elapsed 0.20 sec
2016-06-27 22:57:09.570 CEST LOG: duration: 122006.000 ms bind
cached-1453392550: select job.* from job inner join project on
job.project_id = project.id inner join category on project.category_id
= category.id inner join type on job.type_id = type.id where job.id=$1
2016-06-27 22:57:09.570 CEST DETAIL: parameters: $1 = '34309'
2016-06-27 22:57:09.572 CEST LOG: duration: 122011.000 ms bind
cached--195714239: WITH RECURSIVE jobs AS (SELECT (i).*,
pg_try_advisory_lock((i).id) AS locked FROM ( SELECT i FROM
job_priority AS i WHERE status = $1 ORDER BY priority DESC, received
ASC LIMIT 1) AS i1 UNION ALL (SELECT (i).*,
pg_try_advisory_lock((i).id) AS locked FROM ( SELECT ( SELECT i FROM
job_priority AS i WHERE status = $2 AND (-priority, received) >
(-jobs.priority, jobs.received) ORDER BY priority DESC, received ASC
LIMIT 1) AS i FROM jobs WHERE jobs.id IS NOT NULL LIMIT 1) AS i1 ) )
SELECT id, path, status, received, attempts, last_modified, version
FROM jobs WHERE locked LIMIT 1
2016-06-27 22:57:09.572 CEST DETAIL: parameters: $1 = 'RECEIVED', $2
= 'RECEIVED'
2016-06-27 22:58:07.332 CEST LOG: automatic analyze of table
"app.public.job " system usage: CPU 0.00s/0.06u sec elapsed 0.11 sec
2016-06-27 22:59:07.693 CEST LOG: automatic vacuum of table
"app.public.job ": index scans: 1
pages: 0 removed, 140 remain
tuples: 136 removed, 2391 remain
buffer usage: 546 hits, 0 misses, 209 dirtied
avg read rate: 0.000 MB/s, avg write rate: 3.388 MB/s
system usage: CPU 0.00s/0.00u sec elapsed 0.48 sec

Full table and index schema:
CREATE TABLE job
(
id INTEGER PRIMARY KEY,
a VARCHAR(64) NOT NULL,
path VARCHAR(512) NOT NULL UNIQUE,
status VARCHAR(20) NOT NULL,
project_id INTEGER NOT NULL REFERENCES project(id),
type_id INTEGER NOT NULL REFERENCES type(id),
b INTEGER NOT NULL,
c INTEGER NOT NULL,
d VARCHAR(32) NOT NULL,
received TIMESTAMP NOT NULL,
done TIMESTAMP,
e VARCHAR(512),
error VARCHAR(512),
error_time TIMESTAMP,
attempts INTEGER NOT NULL,
last_modified TIMESTAMP,
version INTEGER
);
\d+ output:
Column | Type | Modifiers | Storage | Stats
------------------+-----------------------------+-----------+----------+------
target | Description
--------+-------------
id | integer | not null | plain |
a | character varying(64) | not null | extended |
path | character varying(512) | not null | extended |
status | character varying(20) | not null | extended |
project_id | integer | not null | plain |
type_id | integer | not null | plain |
b | integer | not null | plain |
c | integer | not null | plain |
d | character varying(32) | not null | extended |
received | timestamp without time zone | not null | plain |
done | timestamp without time zone | | plain |
e | character varying(512) | | extended |
error | character varying(512) | | extended |
error_time | timestamp without time zone | | plain |
attempts | integer | not null | plain |
last_modified | timestamp without time zone | | plain |
version | integer | | plain |
Indexes:
"job_pkey" PRIMARY KEY, btree (id)
"job_path_key" UNIQUE CONSTRAINT, btree (path)
Foreign-key constraints:
"job_project_id_fkey" FOREIGN KEY (project _id) REFERENCES project (id)
"job_type_id_fkey" FOREIGN KEY (type_id) REFERENCES type(id)
Has OIDs: no

Additional information on the table:
- a few columns contain mostly nulls (i.e. a column indicating any error
messages which may occur during processing of the job)
- Due to the nature of the application the table does get a high number of
inserts/updates in a short time. As mentioned above, in the test in which I
observe this issue the job table grows from 0 to 2000-2500 rows very quickly,
with each row receiving on average 4 more updates.
- Note: the CTE selects from a job_priority view which is a JOIN of 3 tables:
job - project - category, with categories having a priority which impacts job
processing order. The test case I am currently using comprises only 5 projects
and a single category.

Explain: unfortunately it's not always the same query that causes issues. If
you want I can still run the most common queries through EXPLAIN ANALYZE but
haven't included this output here as this email is already long enough...

History: this is during development of an application, so there is no
production history

Hardware: i7 CPU, 15 GB of RAM, SSD - this is a development machine, so
there's lots of other stuff running too (but no other databases in active use)

Things I tried:
- Upgrading to PostgreSQL 9.5.3, compiled by Visual C++ build 1800, 64-bit
This did not solve the problem, queries still take 122 seconds from time to
time
- Enable auto_explain to obtain more information. Unfortunately, for the
2-minute queries no plan is logged. If I manually force a query to take a long
time (eg SELECT pg_sleep(5)) or if I set auto_explain.log_min_duration low
enough plans are logged for slow queries, _except_ for these 2-minute
mysteries. This makes it very hard to see exactly why the query is taking so
long. Is this a bug in auto_explain? Are cached plans never logged (this is
not indicated in the documentation)? Some other reason why no plan would be
logged? Note that this happens both in PostgreSQL 9.3.4 and 9.5.3
- Running some queries through EXPLAIN ANALYZE; however, these execute quickly
(a few ms) and don't seem to indicate a clear problem to me

When I run my test there is very high I/O on the system for a few minutes.
However, I am reluctant to point to that as the sole cause for the following
reasons:
- other concurrent queries execute fast (it's a multi threaded application and
other threads continue normally)
- if I look at the resource monitor, some queries remain hung up even after
disk I/O has quieted down
- if I/O performance were the cause I would expect to see a lot more variance,
now the queries always take exactly 122s (with only about 50ms variance)

Given the reasonably small dataset (a pg_dump of the full database containing
about 2500 jobs is less than 1MB) I would think that the whole database fits
in memory anyway, making this issue all the more puzzling. Have I missed
something obvious?

Best regards,
Roel

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Niels Kristian Schjødt 2016-06-29 10:00:00 pg_xlog dir not getting swept
Previous Message George Neuner 2016-06-28 14:10:07 testing - ignore