From: | vignesh C <vignesh21(at)gmail(dot)com> |
---|---|
To: | Greg Nancarrow <gregn4422(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
Subject: | Re: Parallel copy |
Date: | 2020-09-02 05:39:55 |
Message-ID: | CALDaNm0Y_PZvB7jO_gFd2C1nmN0_kSFFb+gA+DqrekPidN-pnQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check that >scenario you faced the >problem.
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>
> When testing with indexes, 4 b-tree indexes were used that each
> included the first column and then distinctly 9 other columns.
>
> A CSV record (row) template file was created with test data
> (corresponding to the table), and that was simply copied and appended
> over and over with a record prefix in order to create the test data
> file.
> The following shell-script basically does it (but very slowly). I was
> using a small C program to do similar, a lot faster.
> In my case, N=2550000 produced about a 5GB CSV file.
>
> file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
> cat sample_record.csv >> $file_out; done
>
> One other thing I should mention is that between each test run, I
> cleared the OS page cache, as described here:
> https://linuxhint.com/clear_cache_linux/
> That way, each COPY FROM is not taking advantage of any OS-cached data
> from a previous COPY FROM.
I will try with a similar test and check if I can reproduce.
> If your data is somehow significantly different and you want to (and
> can) share your script, then I can try it in my environment.
I have attached the scripts that I used for the test results I
mentioned in my previous mail. create.sql file has the table that I
used, insert_data_gen.txt has the insert data generation scripts. I
varied the count in insert_data_gen to generate csv files of 1GB, 2GB
& 5GB & varied the data to generate 1 char, 10 char & 100 char for
each column for various testing. You can rename insert_data_gen.txt to
insert_data_gen.sh & generate the csv file.
Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
insert_data_gen.txt | text/plain | 9.9 KB |
create.sql | application/sql | 1.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2020-09-02 05:58:28 | Re: REINDEX SCHEMA/DATABASE/SYSTEM weak with dropped relations |
Previous Message | Andres Freund | 2020-09-02 05:35:05 | builtin functions, parameter names and psql's \df |