Re: any solution for doing a data file import spawning it on multiple processes

From: Edson Richter <edsonrichter(at)hotmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: any solution for doing a data file import spawning it on multiple processes
Date: 2012-06-16 16:23:21
Message-ID: BLU0-SMTP47A450268B2B3573457921CFFA0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Em 16/06/2012 12:59, hb(at)101-factory(dot)eu escreveu:
> thanks i thought about splitting the file, but that did no work out well.
>
> so we receive 2 files evry 30 seconds and need to import this as fast as possible.
>
> we do not run java curently but maybe it's an option.
> are you willing to share your code?
>
> also i was thinking using perl for it
>
>
> henk
>
> On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter(at)hotmail(dot)com> wrote:
>
>> Em 16/06/2012 12:04, hb(at)101-factory(dot)eu escreveu:
>>> hi there,
>>>
>>> I am trying to import large data files into pg.
>>> for now i used the. xarg linux command to spawn the file line for line and set and use the maximum available connections.
>>>
>>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>>>
>>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it creates zombies ;(.
>>>
>>> does anybody have any other tricks that will do the job?
>>>
>>> thanks,
>>>
>>> Henk
>> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file imports).
>>
>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies, and (of course) no missing records.
>>
>> Besides I each thread import separate file, I have another situation where I have separated threads importing different lines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue in the past for me due Lucene indexes generated during import).
>>
>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27.
>>
>> Regards,
>>
>> Edson Richter
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
I'm not allowed to publish my company's code, but the logic if very easy
to understand (you will have to "invent" your own solution, below code
is bare bone):

class MainThread implements Runnable {
private boolean keepRunning = true;

public void run() {
while(keepRunning) {
try {
executeFiles();
Thread.sleep(30000); // sleep 30 seconds
} catch(Exception ex) {
ex.printStackTrace();
}
}
}

private void executeFiles() {
File monitorDir = new File("/var/mydatafolder/");
File processingDir = new File("/var/myprocessingfolder/");

// I'll import only files with names like "data20120621.csv":
FileFilter fileFilter = new FileFilter() {
public boolean accept(File file) {
boolean isfile = file.isFile() && !file.isHidden() &&
!file.isDirectory();
if(!isfile) return false;
String fname = file.getName();
return fname.startsWith("data") &&
(file.getName().endsWith("csv"));
}
};

List<File> forProcessing = monitorDir.listFiles(fileFilter);

for(File fileFound : forProcessing) {
// FileUtil is a utility class, you will have to create
your own... your move method will vary according your Operating System
FileUtil.move(fileFound, processingDir);
// ProcessFile is a class that implements Runnable, and do
your stuff there...
Thread t = new Thread(new ProcessFile(processingDir,
fileFound.getName()));
t.start();
}
}

/** Use this method to stop the thread from another place in your
complex system! */
public void synchronized stopWorker() {
keepRunning = false;
}

public static void main(String [] args) {
Thread t = new Thread(new MainThread());
t.start();
}
}

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bill House 2012-06-16 17:11:57 v9.1.3 WITH with_query UPDATE
Previous Message Bosco Rama 2012-06-16 16:15:44 Re: any solution for doing a data file import spawning it on multiple processes