When adding virtual CPU cores is the wrong thing to do
I was recently involved in a trouble shooting exercise where an application was giving terrible response times and frequent connection timeouts to visitors. This particular application has peak traffic demands during the business day and very few users overnight. From the checks I made using top and process status reports, it was easy to see that this virtual linux machine was saturated. Multiple virtual CPUs running 99% user+system and a run queue length over 30.
While we were waiting for change control approvals for a middle of the night maintenance window, I started studying the system activity reports (sar) for the past 10 days and keeping an eye on top. What I saw greatly concerned me. This application was running flat out 24 hours a day 7 days a week with no variation in load. As the night wore on in to the early hours, there was no variation of load when logic told me there should have been.
For a transactional system with business inputs, there should be a observable variation as users come in early in the day and then taper off in the late evenings.
I began to suspect that this application was not making effective use of the system resources it was assigned.
The maintenance window came due, and the team doubled the amount of virtual CPUs assigned to this application. Immediately I observed that the run queue and cpu load returned to the same levels as before. There were now twice as much resources, and yet this application completely consumed them. In fact, the response time and connection timeouts were actually slightly worse then before the change.
Eventually, it was determined that there was a combination loop and incompletely written exception handler that caused the high cpu load. Adding more resources made the problem worse because the system was creating even more noise drowning out the actual business transactions. The application team corrected the code in question, and the load dropped to a fraction of what was before.
And in the sar reports, you could now see CPU utilization increasing staring early in the morning and then tapering off as the late evening came around.
Lesson learned : Always check the fundamentals and convince yourself that there is a correlation between problem, sympton and proposed remedy.