From: | Lok P <loknath(dot)73(at)gmail(dot)com> |
---|---|
To: | Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> |
Cc: | pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: How batch processing works |
Date: | 2024-09-22 05:21:07 |
Message-ID: | CAKna9VYu5O5ZsdT9iFJZ3FwYJVo0ocyczYbge1rwYDSiRZ8Rmg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sun, Sep 22, 2024 at 12:46 AM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:
> On 9/21/24 07:36, Peter J. Holzer wrote:
> > On 2024-09-21 16:44:08 +0530, Lok P wrote:
>
> >
> ---------------------------------------------------------------------------------------------------
> > #!/usr/bin/python3
> >
> > import time
> > import psycopg2
> >
> > num_inserts = 10_000
> > batch_size = 50
> >
> > db = psycopg2.connect()
> > csr = db.cursor()
> >
> > csr.execute("drop table if exists parent_table")
> > csr.execute("create table parent_table (id int primary key, t text)")
> > db.commit()
> >
> > start_time = time.monotonic()
> > for i in range(1, num_inserts+1):
> > csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
> > if i % batch_size == 0:
> > db.commit()
> > db.commit()
> > end_time = time.monotonic()
> > elapsed_time = end_time - start_time
> > print(f"Method 2: Individual Inserts with Commit after {batch_size}
> Rows: {elapsed_time:.3} seconds")
> >
> > # vim: tw=99
> >
> ---------------------------------------------------------------------------------------------------
>
> FYI, this is less of problem with psycopg(3) and pipeline mode:
>
> import time
> import psycopg
>
> num_inserts = 10_000
> batch_size = 50
>
> db = psycopg.connect("dbname=test user=postgres host=104.237.158.68")
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time = time.monotonic()
> with db.pipeline():
> for i in range(1, num_inserts+1):
> csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
> if i % batch_size == 0:
> db.commit()
> db.commit()
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit
> after {batch_size} Rows: {elapsed_time:.3} seconds")
>
>
> For remote to a database in another state that took the time from:
>
> Method 2: Individual Inserts with Commit after 50 Rows: 2.42e+02 seconds
>
> to:
>
> Method 2: Individual Inserts(psycopg3 pipeline mode) with Commit after
> 50 Rows: 9.83 seconds
>
> > #!/usr/bin/python3
> >
> > import itertools
> > import time
> > import psycopg2
> >
> > num_inserts = 10_000
> > batch_size = 50
> >
> > db = psycopg2.connect()
> > csr = db.cursor()
> >
> > csr.execute("drop table if exists parent_table")
> > csr.execute("create table parent_table (id int primary key, t text)")
> > db.commit()
> >
> > start_time = time.monotonic()
> > batch = []
> > for i in range(1, num_inserts+1):
> > batch.append((i, 'a'))
> > if i % batch_size == 0:
> > q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
> > params = list(itertools.chain.from_iterable(batch))
> > csr.execute(q, params)
> > db.commit()
> > batch = []
> > if batch:
> > q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
> > csr.execute(q, list(itertools.chain(batch)))
> > db.commit()
> > batch = []
> >
> > end_time = time.monotonic()
> > elapsed_time = end_time - start_time
> > print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each
> batch: {elapsed_time:.3} seconds")
> >
> > # vim: tw=99
> >
> ---------------------------------------------------------------------------------------------------
>
> The above can also be handled with execute_batch() and execute_values()
> from:
>
> https://www.psycopg.org/docs/extras.html#fast-execution-helpers
>
> >
> > On my laptop, method2 is about twice as fast as method3. But if I
> > connect to a database on the other side of the city, method2 is now more
> > than 16 times faster than method3 . Simply because the delay in
> > communication is now large compared to the time it takes to insert those
> > rows.
> >
>
>
>
>
Thank you. So if I get it correct, if the client app(from which the data is
getting streamed/inserted) is in the same data center/zone as the database
(which is most of the time the case) then the batch insert does not appear
to be much beneficial.
Which also means , people here were afraid of having triggers in such a
high dml table as because this will make the "batch insert" automatically
converted into "row by row" behind the scene, but considering the
above results, it looks fine to go with a row by row approach (but just
having batch commit in place in place of row by row commit). And not to
worry about implementing the true batch insert approach as that is not
making a big difference here in data load performance.
From | Date | Subject | |
---|---|---|---|
Next Message | Joe Conway | 2024-09-22 13:47:04 | Re: glibc updarte 2.31 to 2.38 |
Previous Message | Justin | 2024-09-21 20:02:11 | Re: Logical Replication Delay |