From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Peter Geoghegan <pg(at)heroku(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Ian Barwick <ian(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Jim Nasby <jim(at)nasby(dot)net>, Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at> |
Subject: | Re: Doing better at HINTing an appropriate column within errorMissingColumn() |
Date: | 2014-11-19 17:52:24 |
Message-ID: | CA+Tgmob0dHVxkSJjiTWCB5m7nTnk8oSbp7R+=LmwD+eMsTcKfg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Nov 19, 2014 at 12:33 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Wed, Nov 19, 2014 at 5:43 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I think we would be well-advised not to start inventing our own
>> approximate matching algorithm. Peter's suggestion boils down to a
>> guess that the default cost parameters for Levenshtein suck, and your
>> suggestion boils down to a guess that we can fix the problems with
>> Peter's suggestion by bolting another heuristic on top of it - and
>> possibly running Levenshtein twice with different sets of cost
>> parameters. Ugh.
>
> I agree.
>
> While I am perfectly comfortable with the fact that we are guessing
> here, my guesses are based on what I observed to work well with real
> schemas, and simulated errors that I thought were representative of
> human error. Obviously it's possible that another scheme will do
> better sometimes, including for example a scheme that picks a match
> entirely at random. But on average, I think that what I have here will
> do better than anything else proposed so far.
If you agree, then I'm not being clear enough. I don't think think
that tinkering with the Levenshtein cost factors is a good idea, and I
think it's unhelpful to suggest something when the suggestion and the
original word differ by more than 50% of the number characters in the
shorter word. Suggesting "col" for "oid" or "x" for "xmax", as crops
up in the regression tests with this patch applied, shows the folly of
this: the user didn't mean the other named column; rather, the user
was confused about whether a particular system column existed for that
table.
If we had a large database of examples showing what the user typed and
what they intended, we could try different algorithms against it and
see which one performs best with fewest false positives. But if we
don't have that, we should do things that are like the things that
other people have done before.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Fontaine | 2014-11-19 17:57:21 | Re: New Event Trigger: table_rewrite |
Previous Message | Robert Haas | 2014-11-19 17:34:38 | Re: proposal: plpgsql - Assert statement |