Re: Importing text file into a TEXT field

From: Bruno Lavoie <bruno(dot)lavoie(at)gmail(dot)com>
To: Sam Mason <sam(at)samason(dot)me(dot)uk>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Importing text file into a TEXT field
Date: 2008-11-10 14:37:16
Message-ID: 4918471C.1020703@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Sam,

Thanks for your Python script! It's on my todo list to learn this
scripting language. I've done a mini script in Perl for my testing
purposes and then test full text search!

Here is my table definition:
test1=# \d pdfsys.document;
Table "pdfsys.document"
Column | Type | Modifiers
---------------+-----------------------------+-----------
document_id | integer | not null
original_file | character varying(4000) | not null
pdf_text | text | not null
pdf_ts | tsvector |
md5 | character varying(32) |
date_added | timestamp without time zone |
Indexes:
"documents_pkey" PRIMARY KEY, btree (document_id)

And my script, that get all .pdf documents from a directory and extract
text, store it and update tsvector:
#!/usr/bin/perl
use strict;
use warnings;

use DBI;

my $dbname = '***********';
my $user = '***********';
my $pass = '***********';

my $dbh = DBI->connect("dbi:Pg:dbname=$dbname", $user, $pass,
{AutoCommit => 0, RaiseError => 1, PrintError => 0});

opendir(DIR, "./files/");
my @files = grep(/\.pdf$/,readdir(DIR));
closedir(DIR);

undef $/;

my $sth = $dbh->prepare("INSERT INTO pdfsys.document(document_id,
original_file, pdf_text) VALUES (nextval('pdfsys.document_id'), ?, ?)");
foreach my $file (@files) {
print "*** $file ***\n";

open (PS, 'pdftotext -layout -nopgbrk -enc UTF-8 ./files/'. $file .'
- |');
my $content = <PS>;
close (PS);

$sth->execute($file, $content);
}

$dbh->do("update pdfsys.document set pdf_ts = to_tsvector('english',
pdf_text) where pdf_ts is null");

$dbh->commit();
$dbh->disconnect;

It works fine.... I need to develop this more sophisticated: different
language per files, file versionning, etc...

Bruno Lavoie

Sam Mason a écrit :
> On Fri, Nov 07, 2008 at 01:20:27PM -0500, Bruno Lavoie wrote:
>
>> The intent is to use pdftotext and store the resulting text in datbase
>> for full text search purposes... I'm trying to develop a mini content
>> server where I'll put pdf documents to make it searchable.
>>
>
> I've not tried to do this sort of thing before; but the FTS code (native
> in PG 8.3, contrib modules before this version) sounds like what you
> want to be using. As far as getting the data in, you're going to have
> to write a bit of code. A quick hack suggests that you can get things
> going in a small amount of Python code:
>
> import sys;
> import psycopg2;
> conn = psycopg2.connect("");
> cur = conn.cursor();
> cur.execute("INSERT INTO tbl (tsvec) SELECT to_tsvector(%s);",
> [sys.stdin.read()]);
> conn.commit();
>
> You can then do:
>
> pdftotext file.pdf - | python script.py
>
> One performance issue with psycopg2 is that it always expands the SQL;
> you may want to find something that uses PQexecParams() underneath so
> you spend less time escaping everything and then having PG undo that
> work.
>
>
> Sam
>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Bruno Lavoie 2008-11-10 14:39:00 Re: Importing text file into a TEXT field
Previous Message Dmitry Teslenko 2008-11-10 14:22:29 LIKE, "=" and fixed-width character fields