Quick Links

Re: Indexing MS/Open Office and PDF documents

From:	Samba <saasira(at)gmail(dot)com>
To:	dennis jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Indexing MS/Open Office and PDF documents
Date:	2012-03-16 00:45:29
Message-ID:	CAKgWO9JGu004KMJ1RD6HtSWX_tQXAZm2wNVV4fKBtNUx0Ko+3A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Word documents can be processed by Abiword into any msword document into
html, latex, postscript, text formats with very simple commands; i guess it
also exposes some api which can be integrated into document
parsers/indexers.

Spreadsheets can be processed by utilizing *ExcelFormat *library
http://www.codeproject.com/Articles/42504/ExcelFormat-Library

or * BasicExcel *library
http://www.codeproject.com/Articles/13852/BasicExcel-A-Class-to-Read-and-Write-to-Microsoft

Or even the GNU GNumeric project has some api to process spreadsheets which
can be used to extract text and index.

Code to extract text from PDF
http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file

Overall, I guess there are bits and pieces available over the internet and
some dedicated efforts are needed to assemble those and develop into a
finished product, namely document indexer.

Wish you success!

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
On Fri, Mar 16, 2012 at 2:51 AM, dennis jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com
> wrote:

> On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > On Fri, 2012-03-16 at 01:57 +0530, Alexander(dot)Bagerman(at)cognizant(dot)com
> > wrote:
> >> Hi,
> >>
> >> We are looking to use Postgres 9 for the document storing and would
> >> like to take advantage of the full text search capabilities. We have
> >> hard time identifying MS/Open Office and PDF parsers to index stored
> >> documents and make them available for text searching. Any advice would
> >> be appreciated.
> >
> > The first step is to find a library that can parse such documents, or
> > convert them to a format that can be parsed.
>
> I don't know about MS-Office document parsing, but the "PoDoFo" (pdf
> parsing library) can strip text from PDFs. Every now and then someone
> posts to the podofo mailing list with questions related to extracting
> text for the purposes of indexing it in FTS capable database. Podofo
> has excellent developer support. The maintainer is quick to accept
> patches, verify bugs, add features, etc... Disclaimer: I'm not a pdf
> nor podofo expert. I can't help you accomplish what you want.
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Re: Indexing MS/Open Office and PDF documents at 2012-03-15 21:21:48 from dennis jenkins

Browse pgsql-general by date

	From	Date	Subject
Next Message	Dmytrii Nagirniak	2012-03-16 04:38:26	Re: Optimise PostgreSQL for fast testing
Previous Message	BrunoSteven	2012-03-15 23:53:24	Problem for restoure data base Postgre