Re: ETL - sql orchestrator is stuck when there is not sleep() between queries

From: Allan Barrielle <allan(dot)barrielle(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org, Pierrick Martin <pierrick(at)mart1(dot)dev>
Subject: Re: ETL - sql orchestrator is stuck when there is not sleep() between queries
Date: 2021-07-08 13:51:48
Message-ID: CANrBv8pJ-cAJ1DHXL2Bj7akM-_jqEMDgoaz2dHppjkx_GSxFxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On a different machine, we use 12.7. Still same issue

On Thu, Jul 8, 2021 at 3:49 PM Allan Barrielle <allan(dot)barrielle(at)gmail(dot)com>
wrote:

> Hello,
>
> > Is it true that the SELECTs have no joins in them ?
>
> Yes there is a lot of LEFT JOIN.
>
> > When it doesn't work, you could check SELECT * FROM pg_stat_activity,
> and
> >SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.
>
> I can't see any blocking queries blocking pg_locks, pg_blocking_pids.
>
> > It'd be very useful to get "explain analyze" for a working query and for
> a
> > stuck query. It sound like the stuck query never finishes, so maybe the
> second
> > part is impossible (?)
>
> We run an explain analysis and we see some very interesting stuff going on.
> It seems without explicitly adding a `ANALYZE`, the query has a cost of
> over billions, so the query is not stuck but took forever.
> When I run the same scripts with an ANALYZE right before running the
> query, the query is exec is 50secondes and the cost is normal
>
> Explain analyze WITHOUT ANALYZE https://explain.depesz.com/s/RaSr
> Explain analyze same query WITH ANALYZE BEFORE
> https://explain.depesz.com/s/tYVl
>
> The configuration is tuned by aws aurora, but this issue happens also with
> a default config.
>
> allow_system_table_mods,off
> application_name,DataGrip 2021.1.3
> archive_command,(disabled)
> archive_mode,off
> archive_timeout,5min
> array_nulls,on
> authentication_timeout,1min
> autovacuum,on
> autovacuum_analyze_scale_factor,0.05
> autovacuum_analyze_threshold,50
> autovacuum_freeze_max_age,200000000
> autovacuum_max_workers,12
> autovacuum_multixact_freeze_max_age,400000000
> autovacuum_naptime,5s
> autovacuum_vacuum_cost_delay,1ms
> autovacuum_vacuum_cost_limit,1200
> autovacuum_vacuum_scale_factor,0.1
> autovacuum_vacuum_threshold,50
> autovacuum_work_mem,-1
> backend_flush_after,0
> backslash_quote,safe_encoding
> bgwriter_delay,200ms
> bgwriter_flush_after,0
> bgwriter_lru_maxpages,100
> bgwriter_lru_multiplier,2
> bonjour,off
> bytea_output,hex
> check_function_bodies,on
> checkpoint_completion_target,0.9
> checkpoint_flush_after,0
> checkpoint_timeout,15min
> checkpoint_warning,30s
> client_encoding,UTF8
> client_min_messages,notice
> commit_delay,0
> commit_siblings,5
> constraint_exclusion,partition
> cpu_index_tuple_cost,0.005
> cpu_operator_cost,0.0025
> cpu_tuple_cost,0.01
> cursor_tuple_fraction,0.1
> DateStyle,"ISO, MDY"
> db_user_namespace,off
> deadlock_timeout,1s
> debug_pretty_print,on
> debug_print_parse,off
> debug_print_plan,off
> debug_print_rewritten,off
> default_statistics_target,500
> default_text_search_config,pg_catalog.simple
> default_transaction_deferrable,off
> default_transaction_isolation,read committed
> default_transaction_read_only,off
> dynamic_library_path,$libdir
> effective_cache_size,4GB
> effective_io_concurrency,600
> enable_bitmapscan,on
> enable_gathermerge,on
> enable_hashagg,on
> enable_hashjoin,on
> enable_indexonlyscan,on
> enable_indexscan,on
> enable_material,on
> enable_mergejoin,on
> enable_nestloop,on
> enable_parallel_append,on
> enable_parallel_hash,on
> enable_partition_pruning,on
> enable_partitionwise_aggregate,off
> enable_partitionwise_join,off
> enable_seqscan,on
> enable_sort,on
> enable_tidscan,on
> escape_string_warning,on
> event_source,PostgreSQL
> exit_on_error,off
> extra_float_digits,3
> force_parallel_mode,off
> from_collapse_limit,8
> fsync,off
> full_page_writes,off
> geqo,on
> geqo_effort,5
> geqo_generations,0
> geqo_pool_size,0
> geqo_seed,0
> geqo_selection_bias,2
> geqo_threshold,12
> gin_fuzzy_search_limit,0
> gin_pending_list_limit,4MB
> hot_standby,off
> hot_standby_feedback,on
> huge_pages,try
> idle_in_transaction_session_timeout,25min
> ignore_checksum_failure,off
> ignore_system_indexes,off
> IntervalStyle,postgres
> jit,off
> jit_above_cost,100000
> jit_debugging_support,off
> jit_dump_bitcode,off
> jit_expressions,on
> jit_inline_above_cost,500000
> jit_optimize_above_cost,500000
> jit_profiling_support,off
> jit_provider,llvmjit
> jit_tuple_deforming,on
> join_collapse_limit,8
> lc_monetary,C
> lc_numeric,C
> lc_time,C
> listen_addresses,*
> lock_timeout,0
> lo_compat_privileges,off
> maintenance_work_mem,2GB
> max_connections,100
> max_files_per_process,1000
> max_locks_per_transaction,256
> max_logical_replication_workers,4
> max_parallel_maintenance_workers,12
> max_parallel_workers,12
> max_parallel_workers_per_gather,6
> max_pred_locks_per_page,2
> max_pred_locks_per_relation,-2
> max_pred_locks_per_transaction,64
> max_prepared_transactions,0
> max_replication_slots,10
> max_stack_depth,6MB
> max_standby_archive_delay,30s
> max_standby_streaming_delay,14s
> max_sync_workers_per_subscription,2
> max_wal_senders,0
> max_wal_size,8GB
> max_worker_processes,12
> min_parallel_index_scan_size,512kB
> min_parallel_table_scan_size,8MB
> min_wal_size,2GB
> old_snapshot_threshold,-1
> operator_precedence_warning,off
> parallel_leader_participation,off
> parallel_setup_cost,1000
> parallel_tuple_cost,0.1
> password_encryption,md5
> port,5432
> post_auth_delay,0
> pre_auth_delay,0
> quote_all_identifiers,off
> random_page_cost,1.1
> restart_after_crash,on
> row_security,on
> search_path,public
> seq_page_cost,1
> session_replication_role,origin
> shared_buffers,1GB
> standard_conforming_strings,on
> statement_timeout,0
> superuser_reserved_connections,3
> synchronize_seqscans,on
> synchronous_commit,on
> syslog_facility,local0
> syslog_ident,postgres
> syslog_sequence_numbers,on
> syslog_split_messages,on
> tcp_keepalives_count,9
> tcp_keepalives_idle,7200
> tcp_keepalives_interval,75
> temp_buffers,8MB
> temp_file_limit,-1
> TimeZone,UTC
> trace_notify,off
> trace_recovery_messages,log
> trace_sort,off
> track_activities,on
> track_activity_query_size,4kB
> track_commit_timestamp,off
> track_counts,on
> track_functions,none
> track_io_timing,off
> transform_null_equals,off
> update_process_title,on
> vacuum_cleanup_index_scale_factor,0.1
> vacuum_cost_delay,0
> vacuum_cost_limit,200
> vacuum_cost_page_dirty,20
> vacuum_cost_page_hit,1
> vacuum_cost_page_miss,0
> vacuum_defer_cleanup_age,0
> vacuum_freeze_min_age,50000000
> vacuum_freeze_table_age,150000000
> vacuum_multixact_freeze_min_age,5000000
> vacuum_multixact_freeze_table_age,150000000
> wal_buffers,16MB
> wal_compression,off
> wal_level,minimal
> wal_log_hints,off
> wal_receiver_status_interval,10s
> wal_receiver_timeout,30s
> wal_retrieve_retry_interval,5s
> wal_sender_timeout,1min
> wal_sync_method,fdatasync
> wal_writer_delay,200ms
> wal_writer_flush_after,1MB
> work_mem,2GB
> xmlbinary,base64
> xmloption,content
> zero_damaged_pages,off
>
> On Thu, Jul 8, 2021 at 2:33 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
>> On Thu, Jul 08, 2021 at 01:00:28AM +0200, Allan Barrielle wrote:
>> > All is fine, and can work great.
>> > But sometimes, some queries that used to take about 20 secs to complete
>> can
>> > suddenly end in 5mins.
>> > Important all queries have the same shape -> CREATE TABLE SELECT AS *(a
>> bit
>> > of transform) FROM TABLE). No update, nothing, it’s dead simple.
>> >
>> > Nothing works, it’s very random, some query won’t simply work ( even
>> after
>> > hours ).
>>
>> When it doesn't work, you could check SELECT * FROM pg_stat_activity, and
>> SELECT pg_blocking_pids(pid), * FROM pg_locks, to see what's going on.
>>
>> > Important all queries have the same shape -> CREATE TABLE SELECT AS *(a
>> bit
>> > of transform) FROM TABLE). No update, nothing, it’s dead simple.
>> > We are just trying to copy a table from schema1, to schema2, to schema3
>> and
>> > finally schema3. That’s it.
>>
>> Is it true that the SELECTs have no joins in them ?
>>
>> Did this ever work better or differently under different versions of
>> postgres ?
>>
>> > Why does the query never end even after hours ? Why there is no log
>> about
>> > where the query is stuck.
>>
>> Please send your nondefault config.
>> https://wiki.postgresql.org/wiki/Server_Configuration
>>
>> Also enable logging (I just added this to the wiki).
>> https://wiki.postgresql.org/wiki/Slow_Query_Questions#Enable_Logging
>>
>> It'd be very useful to get "explain analyze" for a working query and for a
>> stuck query. It sound like the stuck query never finishes, so maybe the
>> second
>> part is impossible (?)
>>
>> But it'd be good to get at least "explain" output. You'd have to edit
>> your sql
>> script to run an "explain" before each query, and run it, logging the
>> ouput,
>> until you capture the plan for a stuck query. Save the output and send
>> here,
>> along with the query plan for a working query.
>>
>> --
>> Justin
>>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Justin Pryzby 2021-07-08 14:06:03 Re: ETL - sql orchestrator is stuck when there is not sleep() between queries
Previous Message Allan Barrielle 2021-07-08 13:49:12 Re: ETL - sql orchestrator is stuck when there is not sleep() between queries