RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4 (workarounds)

From: "ldh(at)laurent-hasson(dot)com" <ldh(at)laurent-hasson(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4 (workarounds)
Date: 2021-08-30 04:20:38
Message-ID: MN2PR15MB25601B28AF5A3096A646E1DF85CB9@MN2PR15MB2560.namprd15.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> -----Original Message-----
> From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
> Sent: Sunday, August 29, 2021 23:17
> To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
> Cc: ldh(at)laurent-hasson(dot)com; Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>; Ranier
> Vilela <ranier(dot)vf(at)gmail(dot)com>; Andrew Dunstan
> <andrew(at)dunslane(dot)net>; pgsql-performance(at)postgresql(dot)org
> Subject: Re: Big Performance drop of Exceptions in UDFs between V11.2
> and 13.4 (workarounds)
>
> On Mon, Aug 30, 2021 at 04:43:23AM +0200, Pavel Stehule wrote:
> > po 30. 8. 2021 v 2:44 odesílatel ldh(at)laurent-hasson(dot)com napsal:
> > > At this point, I am not sure how to proceed except to rethink that
> > > toFloat() function and many other places where we use exceptions.
> We
> > > get such dirty data that I need a "safe" way to convert a string to
> > > float without throwing an exception. BTW, I tried other
> combinations
> > > in case there may have been some weird interactions with the ::REAL
> > > conversion operator, but nothing made any change. Could you
> > > recommend another approach off the top of your head? I could use
> > > regexes for testing etc... Or maybe there is another option like a
> > > no-throw conversion that's built in or in some extension that you
> may know of? Like the "SAFE." Prefix in BigQuery.
> >
> > CREATE OR REPLACE FUNCTION safe_to_double_precision(t text)
> RETURNS
> > double precision AS $$ BEGIN
> > IF $1 SIMILAR TO '[+-]?([0-9]*[.])?[0-9]+' THEN
> > RETURN $1::double precision;
> > ELSE
> > RETURN NULL;
> > END IF;
> > END;
> > $$ LANGUAGE plpgsql IMMUTABLE STRICT;
>
> This tries to use a regex to determine if something is a "Number" or not.
> Which has all the issues enumerated in painful detail by long answers on
> stack overflow, and other wiki/blog/forums.
>
> Rather than trying to define Numbers using regex, I'd try to avoid only
> the most frequent exceptions and get 90% of the performance back. I
> don't know what your data looks like, but you might try things like this:
>
> IF $1 IS NULL THEN RETURN $2
> ELSE IF $1 ~ '^$' THEN RETURN $2
> ELSE IF $1 ~ '[[:alpha:]]{2}' THEN RETURN $2 ELSE IF $1 !~ '[[:digit:]]' THEN
> RETURN $2
> BEGIN
> RETURN $1::float;
> EXCEPTION WHEN OTHERS THEN
> RETURN $2;
> END;
>
> You can check the stackoverflow page for ideas as to what kind of thing
> to reject, but it may depend mostly on your data (what is the most
> common string?
> The most common exceptional string?).
>
> I think it's possible that could even be *faster* than the original, since it
> avoids the exception block for values which are for sure going to cause
> an exception anyway. It might be that using alternation (|) is faster (if
> less
> readable) than using a handful of IF branches.
>
> --
> Justin

That's exactly where my head was at. I have looked different way to test for a floating point number and recognize the challenge 😊

The data is very messy with people entering data by hand. We have seen alpha and punctuation, people copy/pasting from excel so large numbers get the "e" notation. It's a total mess. The application that authors that data is a piece of crap and we have no chance to change it unfortunately. Short of rolling out an ETL process, which is painful for the way our data comes in, I need an in-db solution.

Thank you!
Laurent.

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message ldh@laurent-hasson.com 2021-08-30 16:04:04 RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4
Previous Message Justin Pryzby 2021-08-30 03:16:48 Re: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4 (workarounds)