From: | Samba <saasira(at)gmail(dot)com> |
---|---|
To: | dennis jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Indexing MS/Open Office and PDF documents |
Date: | 2012-03-16 00:45:29 |
Message-ID: | CAKgWO9JGu004KMJ1RD6HtSWX_tQXAZm2wNVV4fKBtNUx0Ko+3A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Word documents can be processed by Abiword into any msword document into
html, latex, postscript, text formats with very simple commands; i guess it
also exposes some api which can be integrated into document
parsers/indexers.
Spreadsheets can be processed by utilizing *ExcelFormat *library
http://www.codeproject.com/Articles/42504/ExcelFormat-Library
or * BasicExcel *library
http://www.codeproject.com/Articles/13852/BasicExcel-A-Class-to-Read-and-Write-to-Microsoft
Or even the GNU GNumeric project has some api to process spreadsheets which
can be used to extract text and index.
Code to extract text from PDF
http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file
Overall, I guess there are bits and pieces available over the internet and
some dedicated efforts are needed to assemble those and develop into a
finished product, namely document indexer.
Wish you success!
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
On Fri, Mar 16, 2012 at 2:51 AM, dennis jenkins <dennis(dot)jenkins(dot)75(at)gmail(dot)com
> wrote:
> On Thu, Mar 15, 2012 at 4:12 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > On Fri, 2012-03-16 at 01:57 +0530, Alexander(dot)Bagerman(at)cognizant(dot)com
> > wrote:
> >> Hi,
> >>
> >> We are looking to use Postgres 9 for the document storing and would
> >> like to take advantage of the full text search capabilities. We have
> >> hard time identifying MS/Open Office and PDF parsers to index stored
> >> documents and make them available for text searching. Any advice would
> >> be appreciated.
> >
> > The first step is to find a library that can parse such documents, or
> > convert them to a format that can be parsed.
>
> I don't know about MS-Office document parsing, but the "PoDoFo" (pdf
> parsing library) can strip text from PDFs. Every now and then someone
> posts to the podofo mailing list with questions related to extracting
> text for the purposes of indexing it in FTS capable database. Podofo
> has excellent developer support. The maintainer is quick to accept
> patches, verify bugs, add features, etc... Disclaimer: I'm not a pdf
> nor podofo expert. I can't help you accomplish what you want.
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
From | Date | Subject | |
---|---|---|---|
Next Message | Dmytrii Nagirniak | 2012-03-16 04:38:26 | Re: Optimise PostgreSQL for fast testing |
Previous Message | BrunoSteven | 2012-03-15 23:53:24 | Problem for restoure data base Postgre |