How Process Priority Inversion Can Burn CPU via “Waiting” Processes
For the past few weeks, we have been wrestling with an interesting bug in Oracle 11g at Netflix.
We are seeing high CPU attributed with a high number of wait events for the following:
- cursor: mutex S
- latch: shared pool
I was perplexed as to why waits would result in high CPU so I hopped on to Google. I didn’t find an answer for 11g, but I did find something reported in 10.2 and reportedly fixed in 11g.
In Oracle 10.2, on Operating Systems (e.g. AIX, HPUX, etc..) that support process priority decay (e.g. fair round robin, default on AIX 5), priority inversion can cause “blocked threads” to burn CPU. Ignore the fact that this mentions “cursor pin s” waits instead of “cursor mutex s” waits, the pattern is the same.
http://blog.tanelpoder.com/2010/04/21/cursor-pin-s-waits-sporadic-cpu-spikes-and-systematic-troubleshooting/
In Oracle 10.2, latches cause waits, while mutexes cause CPU spins.
Essentially, there is 1 thread with exclusive access to the cursor mutex and there are N threads requesting shared access to the cursor mutex. All are on the runqueue. The N mutex requester threads each jump onto the CPU and realize that the mutex is not yet free. Instead of going to sleep (i.e. joining the waitqueue), they jump onto the end of the runqueue. This burns CPU. For comparison, if a latch were used, the latch requesters would jump onto the waitqueue and sleep, hence not burn CPU.
The mutex holder can then suffer Unix priority decay or what is known as priority inversion depending on the OS scheduling policy in use. This causes the mutex holder to get less time on the CPU. As a result, the mutex holder cannot release the mutex in a timely manner and the mutex requesters are just busy burning CPU.
This pattern will cause high CPU and “blocked” sessions. According to the article above, this is not an issue on 11g.
-s
-
calculatoare-second-hand liked this
-
howardtharp liked this
-
rooksfury posted this