Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres

From: Steve Midgley <science(at)misuse(dot)org>
To: Shaozhong SHI <shishaozhong(at)gmail(dot)com>
Cc: pgsql-sql <pgsql-sql(at)lists(dot)postgresql(dot)org>
Subject: Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres
Date: 2021-10-04 23:34:27
Message-ID: CAJexoSJ8nXoO9mxORayYrhiP+AOwz0Sifj8Y=x02cHLFCV9bVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general pgsql-sql

On Mon, Oct 4, 2021 at 1:19 PM Shaozhong SHI <shishaozhong(at)gmail(dot)com> wrote:

> Hello, Steve,
>
> That is interesting.
>
> Send me the link for Python doing "Copy---From....CSV".
>
> I would like to review and test.
>
> Regards,
>
> David
>
> On Mon, 4 Oct 2021 at 18:30, Steve Midgley <science(at)misuse(dot)org> wrote:
>
>>
>>
>> On Mon, Oct 4, 2021 at 8:55 AM Shaozhong SHI <shishaozhong(at)gmail(dot)com>
>> wrote:
>>
>>> Has anyone tested this one?
>>> A Fast Method to Bulk Insert a Pandas DataFrame into Postgres · Ellis
>>> Valentiner
>>> <https://ellisvalentiner.com/post/a-fast-method-to-insert-a-pandas-dataframe-into-postgres/>
>>>
>>> I tried psql_insert_copy method, but I got the following error message.
>>>
>>> to_sql() got an unexpected keyword argument 'method'
>>>
>>>
>>> Can anyone shed light on this?
>>>
>>>
>> Interesting - I wasn't aware there was a faster method than "COPY ...
>> FROM ... CSV"
>>
>> Maybe my best input is that I've always found COPY/FROM/CSV to be
>> insanely fast and really easy to write from any standard console or ORM, so
>> maybe the fractional performance increase from other methods isn't worth
>> it, given how performant, reliable and easy to use this approach is? Not
>> really an answer but food for thought..
>>
>>
>> The way I've implemented that (using Ruby, but very similar) is to
actually copy from STDIN rather than CSV. In my implementation I wasn't
running my import code on the Postgres server, so I couldn't get the CSV
file onto the local DB machine. So I had to feed the data over the wire via
STDIN. This is slower than if you can point Postgres to a CSV file that the
Pg server itself can access on the file system, but it's more flexible and
still remarkably fast. It can be even (a bit) faster if you don't feed
STDIN line by line, but in my experience it's a nightmare to debug if you
don't get a line number when an import file barfs.

Anyway, here's the Ruby code that shows the fast import concept:
https://gist.github.com/science/393907d4123c87ed767bc81e9dd5a7da

I wrote this a LONG time ago, but I think the concepts are still relevant.
IIRC, feeding Pg from STDIN was ~400x faster than using "insert into" or
other SQL type commands to get the data in.

I hope this is helpful - I would think this code is easily portable to
Python, as you just need to get a raw connection to Postgres from your ORM
adapter and then model the commands the same way in the code above. Good
luck and write with questions..

Steve

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ryan Booz 2021-10-05 00:49:14 Re: Growth planning
Previous Message Dan Davis 2021-10-04 22:57:25 Re: How to build psycopg2 for Windows

Browse pgsql-sql by date

  From Date Subject
Next Message Shaozhong SHI 2021-10-05 13:05:56 Is it possible to combine the power of Postgres and Python in do statement?
Previous Message Shaozhong SHI 2021-10-04 20:13:25 Re: Testing of a fast method to bulk insert a Pandas DataFrame into Postgres