Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED

From: Jim Jarvie <jim(at)talentstack(dot)to>
To: Michael Lewis <mlewis(at)entrata(dot)com>
Cc: Pgsql Performance <pgsql-performance(at)lists(dot)postgresql(dot)org>
Subject: Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Date: 2020-08-20 13:35:38
Message-ID: ccaa5c05-196a-50c4-9d9d-705e3a6bccac@talentstack.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Updates added, mixture of good and bad news:

On 18-Aug.-2020 20:39, Michael Lewis wrote:
> How long does each process take in total? How strict does that FIFO really
> need to be when you are already doing SKIP LOCKED anyway?

The processes all bottleneck[ed] for several minutes, approximately
exponential to the number above the threshold where the problem
happened.  Up to around 60 concurrent worked with minimal delay but
beyond that a few more added about a minute, 70 about 5 minutes, 80
about 30 minutes and beyond that was hours (I left up to 12 hours one time).

However, I removed the order by clause which eliminated [only] the high
CPU.  The processes all stopped in the same pattern, just without the
high CPU use.  So the ordering was the factor in the CPU use, but was
not responsible for the - forgive the pun - lock up.

I then added a random few seconds of delay to each process before it
executes the select in order to prevent too many of them colliding on
simultaneous selects.  That was enough to make the lock-up disappear and
individual selects complete in a few ms, regardless of how many other
concurrent transactions are in progress (tested up to 192 concurrent). 
But still not quite out the woods - read below.

> Can you expound on the partitioning? Are all consumers of the queue always
> hitting one active partition and anytime a row is processed, it always
> moves to one of many? archived type partitions?

Partitions are list partitioned as 'incoming', 'processing', 'retry',
'ok', 'failed':

Incoming: This is itself hash partitioned (64 partitions) approx 10M
rows added/day so partitioned to allow incoming throughput; this works well.

Processing: Simple partition, data is moved into this in blocks as the
rows count drops below a threshold, another block is added, coming from
the incoming.

Retry: simple partition, non fatal errors go in here and go back into
the processing queue for retries later.

Failed: simple partition, fatal errors go here.  Thankfully very few.

OK: hash partition, as everything that was in incoming should eventually
end up here.  64 partitions currently.

There is one interesting thing about this.  When the SELECT FOR UPDATE
SKIP LOCKED is executed, reasonably frequently, the select aborts with
the error:

Tuple to be locked was already moved to another partition due to
concurrent update.

This error still persists even though the lock-up has been removed by
the time delay, so there is a regular stream of transactions aborting
due to this (I just re-run the transaction to recover).

Now, if locking worked as I understand it, if another process locked and
migrated, this should still have left the lock in place on the original
partition and created a new one on the newly inserted partition until a
commit was done.  The second process should not have visibility on the
newly inserted row and the skip locked should simply have skipped over
the locked but deleted row on the original partition.

What am I missing?  All of this feels like some locking/partitioning
issue but I'm unsure exactly what.

> Is that done via FDW or otherwise within the same database transaction? Are
> you connecting some queue consumer application code to Postgres, select for
> update, doing work on some remote system that is slow, and then coming back
> and committing the DB work?
Alas not FDW, an actual external system elsewhere in the world which
sends an ACK when it has processed the message.  I have no control or
influence on this.
> By the way, top-posting is discouraged here and partial quotes with
> interspersed comments are common practice.
Noted!

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Michael Lewis 2020-08-20 17:30:52 Re: CPU hogged by concurrent SELECT..FOR UPDATE SKIP LOCKED
Previous Message Kiran Singh 2020-08-19 13:15:44 Re: Replication lag due to lagging restart_lsn