From: | Bruno Lavoie <bruno(dot)lavoie(at)gmail(dot)com> |
---|---|
To: | Sam Mason <sam(at)samason(dot)me(dot)uk> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Importing text file into a TEXT field |
Date: | 2008-11-10 14:37:16 |
Message-ID: | 4918471C.1020703@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello Sam,
Thanks for your Python script! It's on my todo list to learn this
scripting language. I've done a mini script in Perl for my testing
purposes and then test full text search!
Here is my table definition:
test1=# \d pdfsys.document;
Table "pdfsys.document"
Column | Type | Modifiers
---------------+-----------------------------+-----------
document_id | integer | not null
original_file | character varying(4000) | not null
pdf_text | text | not null
pdf_ts | tsvector |
md5 | character varying(32) |
date_added | timestamp without time zone |
Indexes:
"documents_pkey" PRIMARY KEY, btree (document_id)
And my script, that get all .pdf documents from a directory and extract
text, store it and update tsvector:
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbname = '***********';
my $user = '***********';
my $pass = '***********';
my $dbh = DBI->connect("dbi:Pg:dbname=$dbname", $user, $pass,
{AutoCommit => 0, RaiseError => 1, PrintError => 0});
opendir(DIR, "./files/");
my @files = grep(/\.pdf$/,readdir(DIR));
closedir(DIR);
undef $/;
my $sth = $dbh->prepare("INSERT INTO pdfsys.document(document_id,
original_file, pdf_text) VALUES (nextval('pdfsys.document_id'), ?, ?)");
foreach my $file (@files) {
print "*** $file ***\n";
open (PS, 'pdftotext -layout -nopgbrk -enc UTF-8 ./files/'. $file .'
- |');
my $content = <PS>;
close (PS);
$sth->execute($file, $content);
}
$dbh->do("update pdfsys.document set pdf_ts = to_tsvector('english',
pdf_text) where pdf_ts is null");
$dbh->commit();
$dbh->disconnect;
It works fine.... I need to develop this more sophisticated: different
language per files, file versionning, etc...
Bruno Lavoie
Sam Mason a écrit :
> On Fri, Nov 07, 2008 at 01:20:27PM -0500, Bruno Lavoie wrote:
>
>> The intent is to use pdftotext and store the resulting text in datbase
>> for full text search purposes... I'm trying to develop a mini content
>> server where I'll put pdf documents to make it searchable.
>>
>
> I've not tried to do this sort of thing before; but the FTS code (native
> in PG 8.3, contrib modules before this version) sounds like what you
> want to be using. As far as getting the data in, you're going to have
> to write a bit of code. A quick hack suggests that you can get things
> going in a small amount of Python code:
>
> import sys;
> import psycopg2;
> conn = psycopg2.connect("");
> cur = conn.cursor();
> cur.execute("INSERT INTO tbl (tsvec) SELECT to_tsvector(%s);",
> [sys.stdin.read()]);
> conn.commit();
>
> You can then do:
>
> pdftotext file.pdf - | python script.py
>
> One performance issue with psycopg2 is that it always expands the SQL;
> you may want to find something that uses PQexecParams() underneath so
> you spend less time escaping everything and then having PG undo that
> work.
>
>
> Sam
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Bruno Lavoie | 2008-11-10 14:39:00 | Re: Importing text file into a TEXT field |
Previous Message | Dmitry Teslenko | 2008-11-10 14:22:29 | LIKE, "=" and fixed-width character fields |