Re: [Qgis-user] OCR

From: Mike Ellsworth <younicycle(at)gmail(dot)com>
To: Ramon Andinach <custard(at)westnet(dot)com(dot)au>
Cc: ALT SHN <i(dot)geografica(at)alt-shn(dot)org>, grass-user(at)lists(dot)osgeo(dot)org, qgis-user(at)lists(dot)osgeo(dot)org, pgsql-novice(at)postgresql(dot)org
Subject: Re: [Qgis-user] OCR
Date: 2011-05-04 11:24:13
Message-ID: BANLkTinrieWobY61d1nSjsZfUXJROye_Fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

On Wed, May 4, 2011 at 6:04 AM, Ramon Andinach <custard(at)westnet(dot)com(dot)au> wrote:
>
> On 04/05/2011, at 17:18 , ALT SHN wrote:
>
>> Hello,
>>
>> This might seem a Little off topic, but maybe someone here can help me.
>>
>> I need to extract toponomical data from old digitized paper maps. I wish to explore Optical character recognition (OCR).
>>
>> Does anyone has a suggestion/experience with this kind of challenge?
>>
>> Thank you,
>>
>> André Mano
>
> I'd argue about the "little". I've spent a lot of the last few weeks arguing with OCR software about tables of sample data from old reports, that become point data to plot. Perfectly relevant :)
>
> I've never tried getting data from digitized maps, but I'll offer the following generalisations in case it helps. Generally, I have OCR programmes pass me the results as plain text. I lose the formatting, but I don't have to fix stupid guesses about the formatting.
>
> This is from my experience, so you may find different.
>
> 1. OCR loves paragraphs.
> 2. Different OCR programmes handle column text differently. Some understand columns, some just assume L->R straight across both columns.
> 3. OCR does not get along with handwritten anything. (Unless the person was extra-extra neat and consistent in their writing, and even then it's a maybe.)
> 4. OCR on tabular data works best if the data is lined up in columns, and doesn't have random big gaps.
> 5. OCR will almost certainly be confused if there is a line on your map running through or near a word.
>  5a. Actually lines could confuse it quite a bit - I remember one that tried to recreate an in-line sketch map out of ascii characters. Quite amusing.
> 6. You *will* need to check the results.
>
> I'd love to hear what OCR makes of maps. Very curious.
>
> -ramon.
> --

Going along with the poster above, I think you may have problems if
any of your text is skewed or 'wraps' in any way such that any
character is not horizontal to the scan. Recognizing 7 out of 10,
then cleaning up, may be more time consuming than data entry.

You may be able to 1) scan 2) drag a 'box' over the various data
points and assign an x,y for the box 3) assign each box an identifier
4) then data enter per box. At least you'll have a loose
representation of the data.

I know one thing I'd google for is 'OCR skewed text' and see if there
is an app that can handle that potential problem before I went much
further.

--
Mike Ellsworth

In response to

Browse pgsql-novice by date

  From Date Subject
Next Message Gregor GT. Trefs 2011-05-04 16:50:45 Re: Exponantial Function (exp) with bigger argument ?
Previous Message Mohlomi Moloi 2011-05-04 11:20:56 Re: [Qgis-user] OCR