Quick Links

Re: Indexing MS/Open Office and PDF documents

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Alexander(dot)Bagerman(at)cognizant(dot)com, pgsql-general(at)postgresql(dot)org
Subject:	Re: Indexing MS/Open Office and PDF documents
Date:	2012-03-15 21:17:47
Message-ID:	4F625C7B.7090302@archonet.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On 15/03/12 21:12, Jeff Davis wrote:
> On Fri, 2012-03-16 at 01:57 +0530, Alexander(dot)Bagerman(at)cognizant(dot)com

>> We have
>> hard time identifying MS/Open Office and PDF parsers to index stored
>> documents and make them available for text searching.

> The first step is to find a library that can parse such documents, or
> convert them to a format that can be parsed.

I've used docx2txt and pdf2txt and friends to produce text files that I
then index during the import process. An external script runs the whole
process. All I cared about was extracting raw text though, this does
nothing to identify headings etc.

--
Richard Huxton
Archonet Ltd

In response to

Re: Indexing MS/Open Office and PDF documents at 2012-03-15 21:12:56 from Jeff Davis

Browse pgsql-general by date

	From	Date	Subject
Next Message	dennis jenkins	2012-03-15 21:21:48	Re: Indexing MS/Open Office and PDF documents
Previous Message	Jeff Davis	2012-03-15 21:12:56	Re: Indexing MS/Open Office and PDF documents