Quick Links

Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4

From:	Achilleas Mantzios <a(dot)mantzios(at)cloud(dot)gatewaynet(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>, pgsql-performance(at)lists(dot)postgresql(dot)org, "P(dot)Groidis" <itdev(at)gatewaynet(dot)com>
Subject:	Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4
Date:	2024-11-12 18:24:09
Message-ID:	50da382b-1d47-4332-9279-8af0996f8088@cloud.gatewaynet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Στις 12/11/24 15:17, ο/η Tomas Vondra έγραψε:
> On 11/12/24 11:37, Achilleas Mantzios - cloud wrote:
>> ...
>>
>> We run perf on both systems for 90 seconds to make sure that it included
>> at least two runs of the slow system, and we attach both outputs. We run
>> perf as :
>>
>> sudo perf record -g -p <PID> -- sleep 90
>>
>> it strikes me that we dont find similarities between the two, but we are
>> new to perf.
>>
> Not sure, but it seems the slow profile has a lot of writes. Not sure
> why. Do both instances have the same work_mem value / available memory,
> storatge?
In fact the plain vanilla installation, with all the defaults, runs this
fast, please read further.
>
> Maybe EXPLAIN VERBOSE would show if the remote query is the same ...
>
> Maybe try setting log_temp_files=0 before running the query. Writing the
> foreign scan result set into a temp file could be happening on one of
> the machines only.

No temp files, the remote query shown by EXPLAIN (... VERBOSE) and ms
sql it self is :

Remote query: SELECT NULL FROM [db_ro_non_delosnav].[ACDOC]

Aggregate (cost=168359742.33..168359742.34 rows=1 width=8) (actual
time=15100.248..15100.250 rows=1 loops=1)
  Output: count(*)
  -> Foreign Scan on mssql_bdynacom."ACDOC"
(cost=200.00..168355533.86 rows=1683386 width=100) (actual
time=3.783..14967.029 rows=1683386 loops=1)
        Output: "ID", "DOC_TYPE", "PARAST", "TRNS_DATE", "VIA_MNG",
"VIA_MNG_ID", "FL_UPD", "NOTES", "REM_DATE", "REM_CLS", "OPN_CLS",
"ACSITE_ID", "USERS_ID", "LAST_UPD_DA
TE", "LAST_UPD_USERS_ID", "OTH_TYPE", "OTH_TYPE_DESC", "OTH_APPL",
"FL_PRN", "CREATE_DATE", "LOG_NOTE", "PER_ID", "ACDEPT_ID", "FL_DLT",
"CMP_ID", "INTPER_ID", "FL_INT_TRNCL
S", "FL_ACTV", "OLD_ACDEPT_ID", "OLD_USERS_ID", "OLD_TRNS_DATE",
"OLD_FL_UPD_DATE", "FL_FNLZ", "ERROR_NOTES", "FL_ATTCH", "OPDOC_ID",
"FL_ONHOLD", "ONHOLD_NOTES", "FL_RVS",
"RVS_DOC_ID", "FL_REV", "REV_NO", "REV_DOC_ID", "REV_NOTES", "REV_DATE",
"REV_USER_ID", "FL_PROT", "INTGR_DOC_ID", "INTGR_DOC_TP", "P_DOC_ID",
"FL_OPN", "FL_MDL", "USERUPDAC
C_ID", "DATE_UPDACC", "USERIN_ID", "USERUPD_ID", "DATE_IN", "DATE_UPD",
"DOC_SUBTYPE", "DOC_REF", "FRTRN_TYPE", "FL_ACC", "ATCH_DIR",
"ATCH_REM", "KEYID", "OTH_APPL1", "RGST
R_ID", "ACDOC_LIST_NO", "EXEC_DATE", "EXEC_USER_ID", "EXEC_NOTES",
"CNL_EXEC_DATE", "CNL_EXEC_USER_ID", "FL_EXEC", "CNL_EXEC_NOTES",
"LOG_NOTE1", "USERREUPDACC_ID", "DATE_RE
UPDACC", "P_ACDOC_ID", "FL_AUTHORITY_APRV", "FL_APPROVE_TYPE",
"FL_NOTINCLUDE_INAUTH", "FL_NO_OTHERCURRENCY", "FL_WFSTATUS", "FL_APRV",
"USRACTV_LOG_ID", "USERIN_DNAME", "US
ERUPD_DNAME", "UROLEIN_ID", "UROLEIN_NAME", "UDEPTIN_ID",
"UDEPTIN_NAME", "FL_LOAD_5805", "UROLEUPD_ID", "UROLEUPD_NAME",
"CNTRY_ID", fl_togetback, "FULLPATH", "FL_UPD_OLD",
"FL_REOINTG", "FL_OINTGR_RESEND"
        Remote query: SELECT NULL FROM [db_ro_non_delosnav].[ACDOC]
Query Identifier: 6812542821581303630
Planning Time: 14029.724 ms
JIT:
  Functions: 2
  Options: Inlining true, Optimization true, Expressions true,
Deforming true
  Timing: Generation 0.168 ms, Inlining 2.087 ms, Optimization 3.301
ms, Emission 3.564 ms, Total 9.121 ms
Execution Time: 15102.803 ms
(12 rows)

I gather this is how it computes the plan.

I think we have narrowed down the problem, and this is extremely strange :

It is not a matter of pgsql version 10 or 16, it is a matter of the
postgres user, the user that owns the data dir(s) and the user of the
postgres process. We reproduced both the problem and the solution with
all combinations of versions.

To sum it up :

Our sysadm created the system debian Debian GNU/Linux 12 (bookworm)with
postgres as user 1000. Now at some point we realized that whenever we
run a pgsql cluster with another user (I found that after spending two
good days testing), the above query runs in about 1 second. With user
postgres 1000 in 30 seconds. As you saw the perf output are completely
different.

On the test VM, we removed the postgres user, recreated with uid=1003,
chown -R all the mount points + table spaces, started postgres with the
new postgres user , and bingo . The chown -R on the 5TB is instant...
just saying ...

We are puzzled what can be causing this. Tomorrow we dig into GDB ,
*trace and the like.

If this rings any bells we would be more than grateful to know.

Also, the worse, is the suspicion that maybe our whole infra performance
is affected. We hope it is only free-tds and tds_fdw .

>
>
> regards
>

In response to

Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4 at 2024-11-12 13:17:53 from Tomas Vondra

Responses

Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4 at 2024-11-12 19:37:36 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2024-11-12 19:37:36	Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4
Previous Message	Tomas Vondra	2024-11-12 13:17:53	Re: tds_fdw : Severe performance degradation from postgresql 10.23 to 16.4