both force order of evaluation and hit index

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: both force order of evaluation and hit index
Date: 2019-04-10 19:03:26
Message-ID: 20190410190326.GA31889@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I wrote a script to import CSV query logs as described here:
https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-CSVLOG

I also created indices for searching:
|CREATE INDEX ON postgres_log_2019_04_09_2300 ((regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float)) WHERE message~'^duration:'

I'm having an issue that queries on long duration 1) need to specify "WHERE
message~'^duration'" to hit the index; 2) need to *evaluate* that condition
before the next condition:
|regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float))>9999

I've tried using CASE and OFFSET 0 and WITH, it seems like any effort I make to
force evaluation of "duration>9999" to happen before "message~'^duration'" also
causes a seq scan rather than index scan on query duration.

I'm hesitating to use ORDER BY just to encourage index scan, since it doesn't
even guarantee that it's going to avoid the error.

This is doing index scan on other/"wrong" index:
|postgres=# explain SELECT date_trunc('minute',log_time)log_time, error_severity, session_id, user_name, database, left(message,99), left(query,99)
|FROM postgres_log_2019_04_10_1000 WHERE log_time>'2019-04-10 10:51' AND
|CASE WHEN message~'^duration:' THEN regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float ELSE NULL END > 9999 AND
|(application_name!='psql' OR user_name='postgres') ORDER BY 1 DESC;

| Sort (cost=74908.17..74955.87 rows=19080 width=101)
| Sort Key: (date_trunc('minute'::text, log_time)) DESC
| -> Bitmap Heap Scan on postgres_log_2019_04_10_1000 (cost=22.24..73551.61 rows=19080 width=101)
| Recheck Cond: (log_time > '2019-04-10 10:51:00-04'::timestamp with time zone)
| Filter: (((application_name <> 'psql'::text) OR (user_name = 'postgres'::text)) AND (CASE WHEN (message ~ '^duration:'::text) THEN (regexp_replace(message, '^duration: ([.0-9]+) ms.*'::text, '\1'::text))::double precision ELSE NULL::double precision END > '9999'::double precision))
| -> Bitmap Index Scan on postgres_log_2019_04_10_1000_log_time_idx (cost=0.00..17.47 rows=58197 width=0)
| Index Cond: (log_time > '2019-04-10 10:51:00-04'::timestamp with time zone)

This gives a nice plan, but doesn't avoid errors due to casting to float lines
which don't match the duration regex.

postgres=# explain SELECT date_trunc('minute',log_time)log_time, dur, error_severity, session_id, user_name, database, left(message,99), left(query,99)
FROM (SELECT *,regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float AS dur FROM postgres_log_2019_04_10_1000 WHERE log_time>'2019-04-10 10:50' AND message~'^duration:')x WHERE
dur>9999 AND (application_name!='psql' OR user_name='postgres') ;

| Bitmap Heap Scan on postgres_log_2019_04_10_1000 (cost=5214.86..50141.68 rows=19742 width=109)
| Recheck Cond: ((log_time > '2019-04-10 10:50:00-04'::timestamp with time zone) AND ((regexp_replace(message, '^duration: ([.0-9]+) ms.*'::text, '\1'::text))::double precision > '9999'::double precision) AND (message ~ '^duration:'::text))
| Filter: ((application_name <> 'psql'::text) OR (user_name = 'postgres'::text))
| -> BitmapAnd (cost=5214.86..5214.86 rows=19819 width=0)
| -> Bitmap Index Scan on postgres_log_2019_04_10_1000_log_time_idx (cost=0.00..17.50 rows=59462 width=0)
| Index Cond: (log_time > '2019-04-10 10:50:00-04'::timestamp with time zone)
| -> Bitmap Index Scan on postgres_log_2019_04_10_1000_duration_idx (cost=0.00..5187.23 rows=238241 width=0)
| Index Cond: ((regexp_replace(message, '^duration: ([.0-9]+) ms.*'::text, '\1'::text))::double precision > '9999'::double precision)
|(8 rows)

|ERROR: invalid input syntax for type double precision: "temporary file: path "pg_tblspc/3292386340/PG_11_201809051/pgsql_tmp/pgsql_tmp16493.0", size 603619328"

If I write as CTE or OFFSET 0 subq, that doesn't allow scanning on "duration"
index, and scans on time index instead:

|postgres=# explain analyze WITH x AS (SELECT *,regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float AS dur FROM postgres_log_2019_04_10_1000 WHERE log_time>'2019-04-10 10:50' AND message~'^duration:') SELECT date_trunc('minute',log_time)log_time, dur, error_severity, session_id, user_name, database, left(message,99), left(query,99)
FROM x WHERE dur>9999 AND (application_name!='psql' OR user_name='postgres') ;

|postgres=# explain analyze SELECT date_trunc('minute',log_time)log_time, dur, error_severity, session_id, user_name, database, left(message,99), left(query,99)
FROM (SELECT *,regexp_replace(message, '^duration: ([.0-9]+) ms.*', '\1')::float AS dur FROM postgres_log_2019_04_10_1000 WHERE log_time>'2019-04-10 10:50' AND message~'^duration:' OFFSET 0)x WHERE
dur>9999 AND (application_name!='psql' OR user_name='postgres') ;

Thanks in advance for any advice.

Justin

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2019-04-10 20:04:51 Re: pg_indexes doesn't show indexes for partitioned tables - bug or intended?
Previous Message Michel Pelletier 2019-04-10 18:55:50 Re: How to use full-text search URL parser to filter query results by domain name?