Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2

From: "Yngve N(dot) Pettersen" <yngve(at)spec-work(dot)net>
To: pgsql-general(at)postgresql(dot)org, "Yngve N(dot) Pettersen" <yngve(at)spec-work(dot)net>
Subject: Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2
Date: 2014-03-15 13:53:19
Message-ID: op.xcrni5uu3dfyax@killashandra.invalid.invalid
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hmmm, just tested with the extra conditional indexes in the production
system, and it still took 19 minutes for the first group of queries (23
million entries, 6000 record updates, 8 processes). Afterwards, there were
no such delays.

Munin reports that during those 19 minutes there were, on average, about
1.4 million index fetched tuples per second. Assuming the ANALYZE
operations are not included in those numbers, this seems to indicate that
each process accessed around 200 million tuples in order to update 6000
rows (36000 tuples for each row; depending on what is counted, I would
have expected this to be in the range of 1 to 100 tuples per updated row).

Now that the system have spun up to 200+ processes, looking up existing
data and adding new data, it is accessing about 1.6 million index fetched
tuples per second. IOW 8 processes doing what should have been a simple
update almost matched the activity of 200+ processes (which are not yet
running at full power).

Any suggestions for where to look?

On Sat, 15 Mar 2014 11:49:36 +0100, Yngve N. Pettersen
<yngve(at)spec-work(dot)net> wrote:

> Hi again,
>
> A further update, and it looks like I have finally been able to "fix"
> the problem.
>
> I used gdb to discover where the process is hanging.
>
> As far as I can tell, the processes are looping inside
>
> ExecScan
> calling ExecQual
> calling ExecEvalScalarArrayOp
>
> ExecScan was apparently called from ExecIndexScan, via ExecProcNode,
> ExecModifyTable, ExecProcNode, and standard_ExecutorRun
>
> This operation seems to take 19 minutes for the current test
> configuration
> (23 million entries, updating 6000, 4-10 threads).
>
>
> The explain for the
>
> UPDATE queue_queueitems SET "state"=E'S' WHERE "state" = E'I'
> AND id
> IN (...) ) RETURNING id
>
> queries seem to be
>
> Update on queue_queueitems (cost= ... rows=1 width=22)
> -> Index Scan using queue_queueitems_run_idle on queue_queueitems
> (cost= .... rows=1 width=22)
> Filter: (((state)::text = 'I'::text) AND (id = ANY ('{ ....
> }'::bigint[])))
>
>
> Combined with the GDB info this led me to consider the possibility that
> the query planner selected the wrong index for use with the UPDATE
> query, so I tried adding an index on (id) WHERE "state" = E'I' , and
> this immediately reduced the running time to a couple of seconds; remove
> this new index, and it was back to 19 minutes run time.
>
> IMO the EXPLAIN for the above query should have been using the
> "queue_queueitems_pkey" index for primary filtering, followed by a
> sequential filter based on the "state".
>
> It seems, although I have no idea if it is true, that the planner
> selects the "run_idle" index because of its condition, and ignores the
> facts the indexed column is not related to the query and that the other
> condition in the query is on the primary key. The resulting operation
> appears to be a full sequential scan of the "run_idle" index's record
> list for each single item in the "id" condition, that is it looks like
> it executes 6000*23 million operations, not 6000 record retrieval
> operations, or a filter of 6000 on a set of 23 million in a single
> operation.
>
> I am also starting to suspect that the reason the production systems
> starts speeding up after the first group of queries is that the
> continuous ANALYZE operations on that table every few minutes eventually
> downgrades using the run_idle index for that kind of query, favoring the
> primary key index instead.
>
>
> Table "public.queue_queueitems"
> Column | Type |
> Modifiers
> -----------+----------------------+---------------------------------------------------------------
> id | bigint | not null default
> nextval('queue_queueitems_id_seq'::regclass)
> group_id | integer | not null
> target_id | integer | not null
> state | character varying(1) |
> Indexes:
> "queue_queueitems_pkey" PRIMARY KEY, btree (id)
> "queue_queueitems_group_id" btree (group_id)
> "queue_queueitems_run_idle" btree (group_id) WHERE state::text =
> 'I'::text
> "queue_queueitems_target_id" btree (target_id)
> Foreign-key constraints:
> "queue_queueitems_group_id_fkey" FOREIGN KEY (group_id) REFERENCES
> queue_queuerun(id) DEFERRABLE INITIALLY DEFERRED
> "queue_queueitems_target_id_fkey" FOREIGN KEY (target_id)
> REFERENCES
> queue_queuetarget(id) DEFERRABLE INITIALLY DEFERRED
>
>
>
> On Mon, 10 Mar 2014 02:41:44 +0100, Yngve N. Pettersen
> <yngve(at)spec-work(dot)net> wrote:
>
>>
>> Hi again,
>>
>> I have now had time to do further research about this issue. I have been
>> able to produce a script (available on request) that reproduces the
>> problem, even in tables as small as 100 items and using a single thread,
>> and as a result may have located an area that may cause the problem: A
>> conditional index.
>>
>> As mentioned in the attachment to my previous email the table which is
>> having the problem look like this:
>>
>> Table
>> "public.probedata2_probequeue"
>> Column | Type |
>> Modifiers
>> ----------------+----------------------+-------------------------------------
>> id | bigint | not null default
>> nextval('probedata2_probequeue_id_seq'::regclass)
>> part_of_run_id | integer | not null
>> server_id | integer | not null
>> state | character varying(1) |
>> Indexes:
>> "probedata2_probequeue_pkey" PRIMARY KEY, btree (id)
>> "run_server" UNIQUE CONSTRAINT, btree (part_of_run_id, server_id)
>> "probedata2_probequeue_finished" btree (id) WHERE state::text =
>> 'F'::text
>> "probedata2_probequeue_run_idle" btree (part_of_run_id) WHERE
>> state::text = 'I'::text
>> "probedata2_probequeue_started" btree (part_of_run_id) WHERE
>> state::text = 'S'::text
>> Foreign-key constraints:
>> "probedata2_probequeue_part_of_run_id_fkey" FOREIGN KEY
>> (part_of_run_id) REFERENCES probedata2_proberun(id) DEFERRABLE INITIALLY
>> DEFERRED
>> "probedata2_probequeue_server_id_fkey" FOREIGN KEY (server_id)
>> REFERENCES probedata2_server(id) DEFERRABLE INITIALLY DEFERRED
>>
>>
>> In my test database I have been running tests without the
>> "probedata2_probequeue_run_idle"-equivalent and the other conditional
>> indexes.
>>
>> Without the "run_idle"-index the queries (fetch idle candidates, mark as
>> started, fetch the records) complete in less than a second (<0.9
>> seconds),
>> as expected.
>>
>> *With* the index, the time to complete that operation increases by a
>> factor ranging from 10-30 times for small sequences and a single thread,
>> to 1000 times for large sequences (23 million, 4-10 threads), taking up
>> to 20
>> minutes to complete an update of 6000 rows for each thread, running the
>> process at 100% CPU the whole time.
>>
>> The purpose of the index is to provide quick access to the idle items
>> for
>> a given job, both records and counts. Normally, there will currently be
>> just a single active job in the table, and at the time the problem is
>> occurring all entries for the job will be in the index.
>>
>> As I mentioned last time, the problematic command is be the UPDATE
>> command
>>
>> UPDATE probedata2_probequeue SET "state"=E'S'
>> WHERE "state" = E'I' AND id IN ( .....)
>> RETURNING id
>>
>> and I have confirmed that with the built-in Django debug query
>> information
>> in my test script.
>>
>>
>> On Sat, 08 Feb 2014 15:57:10 +0100, Yngve N. Pettersen
>> <yngve(at)spec-work(dot)net> wrote:
>>
>>> Hi again,
>>>
>>> Sorry about the delay, but an unexpected 3 week trip combined with not
>>> being
>>> able to fix the system's router whose configuration had become
>>> corrupted
>>> before I left, meant that I could not perform any testing until this
>>> week,
>>> after the router had been fixed and reinstalled.
>>>
>>> I just did a test using expanded logging of queries and duration, and
>>> actually saw this happen with just 4 (four) processes, not the 10 I
>>> have
>>> seen before.
>>>
>>> The relevant parts of the log, as well as an EXPLAIN and table info
>>> dump
>>> are attached in a zipfile; the large parts consisting of 6000
>>> to-be-updated
>>> IDs in the UPDATE commands have been removed for clarity (the sequences
>>> were not overlapping, with a numerical distance of at least 80000 from
>>> the other updates, and each sequence was within a range of 7500 IDs)
>>>
>>> Background: Immediately before this operation the queue had been set
>>> up with the
>>> command
>>>
>>> INSERT INTO probedata2_probequeue (part_of_run_id, server_id,
>>> state)
>>> SELECT '334' AS part_of_run_id, server_id, E'I' AS state FROM
>>> probedata2_preparedqueueitem
>>> WHERE part_of_queue_id = '2'
>>>
>>> followed by a COMMIT, a process taking ~15 minutes to copy a 23 million
>>> entry table into the queue. After this the total number of entries in
>>> the
>>> target table is 70 million, distributed across 3 runs.
>>>
>>> Some SELECT count and ANALYZE operations have probably also been
>>> performed
>>> after the initialization operation, before processes
>>>
>>> Shortly afterward (within a couple of minutes), four processes
>>> initiated the SELECT and UPDATE
>>> sequence I outlined earlier, each operation was performed between 1
>>> and 10
>>> seconds after one of the others, 20 seconds from start to last command
>>> started.
>>>
>>> SELECT id FROM probedata2_probequeue
>>> WHERE state = E'I' AND part_of_run_id = 334
>>> LIMIT 6000 OFFSET 85103
>>>
>>> UPDATE probedata2_probequeue SET "state"=E'S'
>>> WHERE "state" = E'I' AND id IN ( .....)
>>> RETURNING id
>>>
>>> The SELECT operations took 21-122ms to complete, while the UPDATEs took
>>> 1093000-1120000 ms (~18 minutes, to complete). During this time the
>>> processes were running at 100% CPU. With more processes involved
>>> earlier I
>>> recall seeing even longer execution times for the UPDATEs, before I
>>> killed
>>> the processes (this is the first time I have seen this kind of
>>> situation
>>> be resolved without killing the processes).
>>>
>>> For reference, once these 4 commands had completed (the system is set
>>> up
>>> to wait until it sees task completed messages from the first processes
>>> that were started, before starting new ones), the next sequence of
>>> these
>>> commands took 122ms and 107ms, respectively, and the second took 50ms
>>> and
>>> 108ms.
>>>
>>> Any suggestions for where to investigate further?
>>>
>>> I am considering setting up a small example to see if I can reproduce,
>>> but
>>> have not had time to do so yet.
>>>
>>> On Sat, 04 Jan 2014 20:06:19 +0100, Yngve N. Pettersen
>>> <yngve(at)spec-work(dot)net> wrote:
>>>
>>>> On Sat, 04 Jan 2014 19:40:31 +0100, Andrew Sullivan
>>>> <ajs(at)crankycanuck(dot)ca> wrote:
>>>>
>>>>> On Sat, Jan 04, 2014 at 07:07:08PM +0100, Yngve N. Pettersen wrote:
>>>>>> I tried that before, but ran into some issues, IIRC a similar
>>>>>> looping
>>>>>> problem as this where queries never ended. I split it up in an
>>>>>> attempt to
>>>>>> solve that problem.
>>>>>
>>>>> Pulling the data out into the application and sending it back in
>>>>> won't
>>>>> improve things. Exactly the same number of rows need to be visited,
>>>>> but the way you have it now you have to marshall all the data and
>>>>> ship
>>>>> it to the application too. So it's automatically slower. Indeed,
>>>>> making it slower might have masked your problem.
>>>>
>>>> Could be
>>>>
>>>>>
>>>>>> In the select/update case there is no sorting in the query; there
>>>>>> is an
>>>>>> offset/limit clause though, number of records retrieved are
>>>>>> currently
>>>>>> restricted to <10000 per query (out of 20 million in the active
>>>>>> subset).
>>>>>>
>>>>>> SELECT id from queue where state = E'I' and job_id = <integer>
>>>>>> offset
>>>>>> <random 200..150000> limit <1-6000>
>>>>>
>>>>> This could be part of problem. Are the different threads working on
>>>>> different job_ids, or is that the same job_id? If you don't SORT
>>>>> that
>>>>
>>>> Same job_id, at least in the current system.
>>>>
>>>>> query before the OFFSET, then the rows will come back in whatever
>>>>> order the system likes.
>>>>
>>>> I suspect that a sort operation on (currently) 20+ million rows for
>>>> every query for just 6000 (previous version was 1500 entries) would
>>>> cause quite a bit more slowness than breaking up the query in two
>>>> operations, or the risk of collisions would, because each process
>>>> would have to load all that information (even if it is cached).
>>>>
>>>>>> However, in the UPDATE case, the looping processes are all UPDATE
>>>>>> queries,
>>>>>> no SELECTs involved.
>>>>>
>>>>> But you said it's all in the same transaction scope. The lock is a
>>>>> the transaction scope.
>>>>
>>>> But the statement it locks/loops on are only UPDATE statements, also
>>>> in the processes that are waiting.
>>>>
>>>>> Anyway, what I'd do is try to cause the condition and post the
>>>>> pg_locks information. When I've done this in the past, usually the
>>>>> best thing to do is also to have query logs on for everything (don't
>>>>> forget to log the pid!) so you can see what the other transaction
>>>>> you're waiting on touched already. You can usually find the
>>>>> inversion
>>>>> that way. Once you see it, it's always obvious what you've done, in
>>>>> my experience (and completely mystifying before that, unfortunately).
>>>>
>>>> Will take a look in a few days, probably midweek.
>>>>
>>>>
>>>
>>>
>>
>>
>
>

--
Sincerely,
Yngve N. Pettersen

Using Opera's mail client: http://www.opera.com/mail/

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Thom Brown 2014-03-15 14:01:42 Re: Multi-dimensional arrays
Previous Message Raymond O'Donnell 2014-03-15 12:51:41 Multi-dimensional arrays