Re: How batch processing works

From: "Peter J(dot) Holzer" <hjp-pgsql(at)hjp(dot)at>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: How batch processing works
Date: 2024-09-21 14:36:29
Message-ID: 20240921143629.t2x37xfczeeunpnf@hjp.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 2024-09-21 16:44:08 +0530, Lok P wrote:
> But wondering why we don't see any difference in performance between method-2
> and method-3 above.

The code runs completely inside the database. So there isn't much
difference between a single statement which inserts 50 rows and 50
statements which insert 1 row each. The work to be done is (almost) the
same.

This changes once you consider an application which runs outside of the
database (maybe even on a different host). Such an application has to
wait for the result of each statement before it can send the next one.
Now it makes a difference whether you are waiting 50 times for a
statement which does very little or just once for a statement which does
more work.

> So does it mean that,I am testing this in a wrong way or

That depends on what you want to test. If you are interested in the
behaviour of stored procedures, the test is correct. If you want to know
about the performance of a database client (whether its written in Java,
Python, Go or whatever), this is the wrong test. You have to write the
test in your target language and run it on the client system to get
realistic results (for example, the round-trip times will be a lot
shorter if the client and database are on the same computer than when
one is in Europe and the other in America).

For example, here are the three methods as Python scripts:

---------------------------------------------------------------------------------------------------
#!/usr/bin/python3

import time
import psycopg2

num_inserts = 10_000

db = psycopg2.connect()
csr = db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")

start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 1: Individual Inserts with Commit after every Row: {elapsed_time:.3} seconds")

# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3

import time
import psycopg2

num_inserts = 10_000
batch_size = 50

db = psycopg2.connect()
csr = db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()

start_time = time.monotonic()
for i in range(1, num_inserts+1):
csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
if i % batch_size == 0:
db.commit()
db.commit()
end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 2: Individual Inserts with Commit after {batch_size} Rows: {elapsed_time:.3} seconds")

# vim: tw=99
---------------------------------------------------------------------------------------------------
#!/usr/bin/python3

import itertools
import time
import psycopg2

num_inserts = 10_000
batch_size = 50

db = psycopg2.connect()
csr = db.cursor()

csr.execute("drop table if exists parent_table")
csr.execute("create table parent_table (id int primary key, t text)")
db.commit()

start_time = time.monotonic()
batch = []
for i in range(1, num_inserts+1):
batch.append((i, 'a'))
if i % batch_size == 0:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
params = list(itertools.chain.from_iterable(batch))
csr.execute(q, params)
db.commit()
batch = []
if batch:
q = "insert into parent_table values" + ",".join(["(%s, %s)"] * len(batch))
csr.execute(q, list(itertools.chain(batch)))
db.commit()
batch = []

end_time = time.monotonic()
elapsed_time = end_time - start_time
print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each batch: {elapsed_time:.3} seconds")

# vim: tw=99
---------------------------------------------------------------------------------------------------

On my laptop, method2 is about twice as fast as method3. But if I
connect to a database on the other side of the city, method2 is now more
than 16 times faster than method3 . Simply because the delay in
communication is now large compared to the time it takes to insert those
rows.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp(at)hjp(dot)at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Lok P 2024-09-21 15:25:13 Re: How batch processing works
Previous Message Peter J. Holzer 2024-09-21 13:19:31 Re: IO related waits