Re: How batch processing works

From: Lok P <loknath(dot)73(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: How batch processing works
Date: 2024-09-21 15:25:13
Message-ID: CAKna9VZA5x1g7-4j8cQrW1ByqX5jtPwY7sY2MHuRgrxrOu4LBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Sep 21, 2024 at 8:07 PM Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:

> On 2024-09-21 16:44:08 +0530, Lok P wrote:
> > But wondering why we don't see any difference in performance between
> method-2
> > and method-3 above.
>
> The code runs completely inside the database. So there isn't much
> difference between a single statement which inserts 50 rows and 50
> statements which insert 1 row each. The work to be done is (almost) the
> same.
>
> This changes once you consider an application which runs outside of the
> database (maybe even on a different host). Such an application has to
> wait for the result of each statement before it can send the next one.
> Now it makes a difference whether you are waiting 50 times for a
> statement which does very little or just once for a statement which does
> more work.
>
> > So does it mean that,I am testing this in a wrong way or
>
> That depends on what you want to test. If you are interested in the
> behaviour of stored procedures, the test is correct. If you want to know
> about the performance of a database client (whether its written in Java,
> Python, Go or whatever), this is the wrong test. You have to write the
> test in your target language and run it on the client system to get
> realistic results (for example, the round-trip times will be a lot
> shorter if the client and database are on the same computer than when
> one is in Europe and the other in America).
>
> For example, here are the three methods as Python scripts:
>
>
> ---------------------------------------------------------------------------------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts = 10_000
>
> db = psycopg2.connect()
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
>
> start_time = time.monotonic()
> for i in range(1, num_inserts+1):
> csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
> db.commit()
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 1: Individual Inserts with Commit after every Row:
> {elapsed_time:.3} seconds")
>
> # vim: tw=99
>
> ---------------------------------------------------------------------------------------------------
> #!/usr/bin/python3
>
> import time
> import psycopg2
>
> num_inserts = 10_000
> batch_size = 50
>
> db = psycopg2.connect()
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time = time.monotonic()
> for i in range(1, num_inserts+1):
> csr.execute("insert into parent_table values(%s, %s)", (i, 'a'))
> if i % batch_size == 0:
> db.commit()
> db.commit()
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 2: Individual Inserts with Commit after {batch_size} Rows:
> {elapsed_time:.3} seconds")
>
> # vim: tw=99
>
> ---------------------------------------------------------------------------------------------------
> #!/usr/bin/python3
>
> import itertools
> import time
> import psycopg2
>
> num_inserts = 10_000
> batch_size = 50
>
> db = psycopg2.connect()
> csr = db.cursor()
>
> csr.execute("drop table if exists parent_table")
> csr.execute("create table parent_table (id int primary key, t text)")
> db.commit()
>
> start_time = time.monotonic()
> batch = []
> for i in range(1, num_inserts+1):
> batch.append((i, 'a'))
> if i % batch_size == 0:
> q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
> params = list(itertools.chain.from_iterable(batch))
> csr.execute(q, params)
> db.commit()
> batch = []
> if batch:
> q = "insert into parent_table values" + ",".join(["(%s, %s)"] *
> len(batch))
> csr.execute(q, list(itertools.chain(batch)))
> db.commit()
> batch = []
>
> end_time = time.monotonic()
> elapsed_time = end_time - start_time
> print(f"Method 3: Batch Inserts ({batch_size}) with Commit after each
> batch: {elapsed_time:.3} seconds")
>
> # vim: tw=99
>
> ---------------------------------------------------------------------------------------------------
>
> On my laptop, method2 is about twice as fast as method3. But if I
> connect to a database on the other side of the city, method2 is now more
> than 16 times faster than method3 . Simply because the delay in
> communication is now large compared to the time it takes to insert those
> rows.
>
>
Thank you so much.
I was expecting method-3(batch insert) to be the fastest or atleast as you
said perform with similar speed as method-2 (row by row insert with batch
commit) if we do it within the procedure inside the database. But because
the context switching will be minimal in method-3 as it will prepare the
insert and submit to the database in one shot in one DB call, so it should
be a bit fast. But from your figures , it appears to be the opposite ,
i.e.method-2 is faster than method-3. Not able to understand the reason
though. So in this case then ,it appears we can follow method-2 as that is
cheaper in regards to less code change , i.e. just shifting the commit
points without any changes for doing the batch insert.

Btw,Do you have any thoughts, why method-2 is faster as compared to
method-3 in your test?

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2024-09-21 17:20:27 Re: IO related waits
Previous Message Peter J. Holzer 2024-09-21 14:36:29 Re: How batch processing works