Re: any solution for doing a data file import spawning it on multiple processes

From: "hb(at)101-factory(dot)eu" <hb(at)101-factory(dot)eu>
To: Edson Richter <edsonrichter(at)hotmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: any solution for doing a data file import spawning it on multiple processes
Date: 2012-06-16 18:15:33
Message-ID: 9E7215C2-C9C6-4B5A-A4D9-825ED610FF84@101-factory.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

thanks all, i will be looking into it.

Met vriendelijke groet,

Henk

On 16 jun. 2012, at 18:23, Edson Richter <edsonrichter(at)hotmail(dot)com> wrote:

> Em 16/06/2012 12:59, hb(at)101-factory(dot)eu escreveu:
>> thanks i thought about splitting the file, but that did no work out well.
>>
>> so we receive 2 files evry 30 seconds and need to import this as fast as possible.
>>
>> we do not run java curently but maybe it's an option.
>> are you willing to share your code?
>>
>> also i was thinking using perl for it
>>
>>
>> henk
>>
>> On 16 jun. 2012, at 17:37, Edson Richter <edsonrichter(at)hotmail(dot)com> wrote:
>>
>>> Em 16/06/2012 12:04, hb(at)101-factory(dot)eu escreveu:
>>>> hi there,
>>>>
>>>> I am trying to import large data files into pg.
>>>> for now i used the. xarg linux command to spawn the file line for line and set and use the maximum available connections.
>>>>
>>>> we use pg pool as connection pool to the database, and so try to maximize the concurrent data import of the file.
>>>>
>>>> problem for now that it seems to work well but we miss a line once in a while, and that is not acceptable. also it creates zombies ;(.
>>>>
>>>> does anybody have any other tricks that will do the job?
>>>>
>>>> thanks,
>>>>
>>>> Henk
>>> I've used custom Java application using connection pooling (limited to 1000 connections, mean 1000 concurrent file imports).
>>>
>>> I'm able to import more than 64000 XML files (about 13Kb each) in 5 minutes, without memory leaks neither zombies, and (of course) no missing records.
>>>
>>> Besides I each thread import separate file, I have another situation where I have separated threads importing different lines of same file. No problems at all. Do not forget to check your OS "file open" limits (it was a big issue in the past for me due Lucene indexes generated during import).
>>>
>>> Server: 8 core Xeon, 16Gig, SAS 15000 rpm disks, PgSQL 9.1.3, Linux Centos 5, Sun Java 1.6.27.
>>>
>>> Regards,
>>>
>>> Edson Richter
>>>
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
> I'm not allowed to publish my company's code, but the logic if very easy to understand (you will have to "invent" your own solution, below code is bare bone):
>
> class MainThread implements Runnable {
> private boolean keepRunning = true;
>
> public void run() {
> while(keepRunning) {
> try {
> executeFiles();
> Thread.sleep(30000); // sleep 30 seconds
> } catch(Exception ex) {
> ex.printStackTrace();
> }
> }
> }
>
> private void executeFiles() {
> File monitorDir = new File("/var/mydatafolder/");
> File processingDir = new File("/var/myprocessingfolder/");
>
> // I'll import only files with names like "data20120621.csv":
> FileFilter fileFilter = new FileFilter() {
> public boolean accept(File file) {
> boolean isfile = file.isFile() && !file.isHidden() && !file.isDirectory();
> if(!isfile) return false;
> String fname = file.getName();
> return fname.startsWith("data") && (file.getName().endsWith("csv"));
> }
> };
>
> List<File> forProcessing = monitorDir.listFiles(fileFilter);
>
> for(File fileFound : forProcessing) {
> // FileUtil is a utility class, you will have to create your own... your move method will vary according your Operating System
> FileUtil.move(fileFound, processingDir);
> // ProcessFile is a class that implements Runnable, and do your stuff there...
> Thread t = new Thread(new ProcessFile(processingDir, fileFound.getName()));
> t.start();
> }
> }
>
> /** Use this method to stop the thread from another place in your complex system! */
> public void synchronized stopWorker() {
> keepRunning = false;
> }
>
> public static void main(String [] args) {
> Thread t = new Thread(new MainThread());
> t.start();
> }
> }
>
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Vibhor Kumar 2012-06-16 18:27:29 Re: v9.1.3 WITH with_query UPDATE
Previous Message Yeb Havinga 2012-06-16 17:57:01 Re: v9.1.3 WITH with_query UPDATE