Solaris Troubleshooting : how system generates Utilization Statistics ( usr, sys, idle ) in Solaris 10 and prior.
in Solaris 10, the method that is used to calculate usr, sys and idle time has changed. As well, the waitio time has been hard-coded to 0 in all instances (because it was never a very meaningful statistic).
Because of the changes, running the same workload on S9 and S10 may produce different statistics even though the workload is really running similarly on both systems. This can lead to confusion, especially with respect to capacity planning. Here we attempt to address the sources of differences.
In addition, there is another effect in S10 because interrupt time is not necessarily charged as sys time (as described below). which was addressed in later versions
How was utilization calculated prior to S10?
Prior to S10, utilization was calculated using a sampling technique as described in Leffler et. al., Design & Implementation of BSD UNIX – pp.51-52. A simple summary of the technique is:
On every clock tick (in level 10 interrupt context) the CPU that handles the clock interrupt will:
1) Inspect each CPU to see what task is running (usr/sys/idle) and chalk up 1 more tick to the appropriate category.
2) Invoke callout_schedule() to process any timeouts that are due for this clock tick.
This means utilization is a statistical sampling of the CPU activity on the clock tick (100Hz by default, but changeable) and it’s reasonably accurate unless there is significant callout scheduling.
Why was the calculation changed in S10?
Statistical sampling works well when the samples are taken randomly but that’s not the case with this classic implementation. Far from being random, the sampling is actually synchronized with some of the activity it’s meant to measure.
In the case where there is significant callout scheduling, lots of threads may be made runnable by the clock tick, do their work, and block waiting for another timeout just prior to the next clock tick. If the system is less than 100% utilized, then that next clock tick may occur with most CPUs in the idle loop since the callout work finished before the tick. This may tend to misrepresent the usage by over-estimating idle time and under-estimating usr and/or sys time, implying the system has more capacity headroom than is really the case. In addition, if the system is close to 100% utilized, then the sampling error may cause time to be charged against the wrong thread.
Applications that use poll() timeouts intensively are typically the ones where the classic technique misreports most. It’s also open to subversion by clever programmers who can figure out how to get off CPU just prior to the clock interrupt and so avoid the accounting process charging their process.
Benchmark systems can be subject to misreporting as well because the benchmark driver system generating the workload may be pacing the workload using its own 100Hz clock. CPU utilization observed on the system under test (SUT) can vary markedly depending on the phase difference of the 100Hz clocks on the driver and SUT systems.
For the vast majority of systems, the work is asynchronous to the system clock and/or generated by external events (e.g. disk and network interrupts) and for these, clock-based sampling provides very reasonable accuracy with very low overhead.
How has utilization calculation changed in S10?
With S10, the decision was made to get rid of statistical sampling and move to microstate accounting whereby time accounting calculations are made each time processes change state. This was intended to be more accurate since each LWP and CPU has the microstate accounting data that’s updated directly as processes move from state to state. One result is that users may see a decreased (but more correct) reporting of idle time versus Solaris 9 for the same workload, and mistakenly believe that Solaris 10 is less efficient and offers less headroom. However, the vanished headroom never really existed, and users now have a more accurate measure of headroom to use for capacity planning.
What new problem is now in S10?
With the spotlight now on microstate accounting (which presumably nobody had looked closely at before) we are now realising an important shortcoming. The microstates are not corrected for the time spent in a given state pinned by interrupts (this includes the idle process). So now, in the presence of a high interrupt load, the usr/sys/idle breakdown can be very much off, since it will
be counted against whatever process was pinned by the interrupt. Thus, sys time is under-estimated, and either usr or idle or both may be over-estimated.
For example a CPU can run 50% of the time under pci_intr_wrapper (as per dtrace profile data) but still be reported as 100% idle under mpstat if it is only handling interrupts, since the time will be charged to idle.
There is also a small performance hit because there is extra code in the path to do the accounting on state changes, but ignoring that and the fact that it fails to report interrupt time correctly, it has another flaw: its probe effect makes it inaccurate, tending especially to under-report sys time. Take the typical case of making syscalls. We only think about recording that we are in the kernel when we are into the trap handler by 26 instructions and have already executed 6 loads, all of which could miss in the E$ (different cache lines) and a store involving a 7th cache line.
Then we call syscall_mstate(); 28 instructions, 2 or 3 cachelines of text so far. Then we may take a register window spill on the first instruction and have accessed one or two cache lines of text before calling gethrtime_unscaled(), yet another cache line of text, where we read the tick register. At least the mstate update on syscall return is closer to the end. However, there is quite a bit of data and text footprint in this call to syscall_mstate() _after_ reading the tick. Again the net result is that sys time is under-estimated, and usr time is over-estimated.
An extreme case of this was demonstrated with a test program doing a lot of semop() calls where mpstat showed 60/40 usr/sys but the true accounting should have been more like 25/75. Most cases will not see that large an effect. In addition, this source of inaccuracy does not affect idle time and so capacity planning should still be accurate.
What was fixed in Later versions?
With the appropriate patch installed, the interrupt time will be apportioned to system time in the same manner as was done in previous versions of Solaris. This should mean that there will be fewer differences between S9 and S10 statistics and the statistics will be more accurate in systems with significant interrupt time. To do this we keep a second set of microstate stats in the CPU structure which counts the ticks spent in interrupt mode according to whether it was usr, sys or idle that got interrupted. Then whenever the CPU microstate is fetched, the interrupt time for which usr and idle were pinned is subtracted from those CPU microstate categories and added to the sys CPU microstate.
Here is a useful table that summarizes the effects of each bias in each release:
release usr sys idle
S8/S9 under under over
S10 over under over
S10U1 over under accurate
Note that the amount over or under in all cases (especially S10) is usually very small and with the addition of the fixes in S10U1 to handle interrupts the statistics should be more accurate than ever.