Performance is critical in most real-time systems. As any
programmer knows, finding the causes of poor performance can be
a difficult and time-consuming task. Often, the problem is easy
to fix, if and when the root cause is discovered. Statistical
profiling is an important tool for improving the performance of
any program. It is important in general programming as it
measures the execution pattern of your application on a
procedure-by-procedure basis by taking samples of the execution
state. These samples are used to calculate where the CPU is
spending its time. This information is exactly what you need to
tune your time-critical system for maximum performance.
In many real-time operating systems, such as VxWorks, each
task has an associated stack. The stack is used as a temporary
storage area for both local program variables and program
execution state. Each time a new function is entered, the
processor allocates a new section of the stack to hold the
information that function needs. This section is called the
function's stack frame. It contains the function's temporary
variables and several key pieces of information, including
saved processor registers and the return address of the parent
routine. A register in the processor, called the frame pointer
(FP), always points to the currently executing function's stack
frame. When a subroutine is called, the old frame pointer is
saved on the stack, a new stack frame is created, and the frame
pointer is updated. Thus, at any instant in time, the entire
function call history is present on the stack and can be
determined by "walking" down the chain of frame pointers stored
on the stack.
Statistical profiling works by sampling this execution
state. By examining the program counter, the profiler can
determine which function in your program was executing at the
time of the sample. By examining the chain of stack frames, the
function sequence that resulted in the current execution state
can also be determined.
The sampling clock must be relatively independent of the
executing program so that it can generate "random" samples that
can be compiled into statistics to show the program's execution
patterns. This compilation is performed by a low-priority task
on the target. The steady-state CPU usage for a "typical"
application profile is only about 3-5%, which is by default
excluded from the profile display. On-target analysis has the
huge benefit of avoiding all network traffic by the tool
itself. Thus, networking applications can be accurately
profiled.
As an example, consider this portion of a network server's
execution tree as analyzed by Real-Time Innovations (RTI)
product, ProfileScope:
->ProfileShowTree "_TcpWrite",0,0.5
Profiling task 0x3b20e0: net4
Current
sum % direct %
10.43 .12 _TcpWrite
10.31 0.01 . _write
10.27 .18 . . _iosWrite
10.06 .16 . . . _send
9.35 .64 . . . . _sosend
7.92 .27 . . . . . _tcp_usrreq
7.25 .88 . . . . . . _tcp_output
4.97 .49 . . . . . . . _ip_output
3.87 0.06 . . . . . . . . _eiOutput
3.80 .51 . . . . . . . . . _ether_output
2.32 .31 . . . . . . . . . . _eiTxStartup
1.12 1.12 . . . . . . . . . . . _bcopy
.69 .69 . . . . . . . _cksum
.63 .63 . . . . . _bcopy
The example shows the server's TCP write activity. The
profile clearly shows the overhead incurred at each layer: the
I/O system, the socket layer, TCP, IP, and the Ethernet driver.
Both cumulative (including all subroutines) and direct CPU
usage of each routine are captured. You can easily see how much
time was required to copy data (bcopy) and calculate checksums
(cksum). Note that operating-system code is profiled along with
user code, a critical requirement for system performance
tuning.
The sum column indicates how many times that routine was
active, either on the stack or directly. The direct column
indicates how many times the program counter was directly
executing that routine (the direct values don't always add up
to match the sum, because any call chain using less than .5% of
the CPU is not normally displayed). From this output, many
important facts can be learned. For instance, you can see the
overhead incurred in each layer of the network protocol, the
total percentage of time spent writing to the socket, and the
number of times the data was copied.
A profiler may keep both short and long-term statistics. The
long-term statistics simply reflect the activity of the CPU for
the entire time profiling has been active. The short-term
statistics offer a dynamic record of the recent CPU activity.
This dynamic record reflects the activity of the CPU over the
last few seconds.
Dynamic records are used to analyze load changes as a
program executes, with the output plotted in real-time. Many
profilers use a bar chart format. Figure 1
contains an example of a dynamic record, showing a plot with
RTI's data monitoring tool, StethoScope, with vertical bars and
a time history.
There are several tools that allow real-time programmers to
collect performance measurements. In addition to function-level
statistical profiling you can use operating system monitors and
coverage analysis. All these tools are useful for tuning your
application code for maximum performance. While they are very
different, their output is quite complementary.
Figure 1: Dynamic Execution
Profiling
Since the profiler keeps dynamic CPU usage statistics, it can
be used to analyze changing programs. Here, a network server
handles a varying connection load. Note that each function
making up the load can be independently measured.
Other Types of Performance Tools
There are many tools that allow real-time programmers to
collect different performance measurements such as operating
system monitors. These types of tools give information quite
different from statistical profiling. For example, if your
application is thrashing (spending most of its time context
switching), a profiler will simply report that most of the CPU
is being spent in the kernel. An OS monitor will show you which
tasks are switching, and display the events (e.g. interrupts)
that are causing the switches. The monitor is clearly more
useful in this situation.
On the other hand, if you have one task hogging the CPU, an
OS monitor will only tell you that one task is running most of
the time. Statistical profiling, however, will provide a
detailed function-by-function analysis, breaking down the
individual routines within the task that are burning the CPU.
It will give you a direct map of what that task is doing, which
routines are being called, what routines they call, and point
out exactly where the inefficiencies are.
Statistical profiling won't reliably determine if every line
of code has been executed at least once; that requires a
coverage analysis tool like Tornado's CodeTest/Coverage. A
profiler measures how much CPU each function in the program is
using over a reasonable time period. This is the information
needed to speed up the overall execution of the system.
Although it cannot evaluate absolute coverage, a profiler
can analyze the general execution tree of the system to give
you an idea how the various routines are being executed. In
some cases, very complete maps can be generated. Further, some
profilers, like ProfileScope, have the advantage that they
require no code modification and can thus analyze already
running code as well as operating-system routines.
When To Use a Statistical Profiler
The most important applications for a profiler in a real-time
system are:
- Identifying excessive usage of the CPU, so you can
optimize the code, or re-code to not utilize CPU-intensive
functions.
- Timing the execution of a set of routines, e.g. to
determine what routines contribute to the time of response to
some event.
- Analyzing execution trees.
Example Usage of a Statistical Profiler
Examples of each of the following applications, (see
Analysis) show the speed with which a statistical
profiler can acquire the data needed to analyze the
software.
In Application 1, the profiler must examine
a large number of routines (often the entire system). The
routines in question are high-probability events; it's not much
use optimizing routines whose total usage of the CPU (including
all subroutine calls) is less than about 1%. For this
application, accuracy is not that important. The analysis
assumes you need a two-digit accurate measurement of a routine
that uses 1% of the CPU. The case of a routine that uses only
0.1% of the CPU is also examined for completeness. Asking for
two-digit accuracy for an event with a probability of 0.001
(1/1000 of the CPU) requires a tool capable of measuring 1 part
in 100,000. Building an execution tree this detailed is roughly
equivalent to building a complete map of every mile a car has
driven over its entire 100,000 mile lifespan.
Application 2 addresses the same problem as
Application 1, but handles situations where it is important to
know the exact number of times each procedure being timed has
been executed. This is not usually a problem as much code is
periodic. If the code is not periodic, a simple counter can be
used. Given the execution frequency, the CPU percentage can be
directly translated into elapsed time.
Application 3 requires mostly that the
profiler catch most of the "leaves" of the execution tree. The
leaves may be extremely rare events, but high accuracy in
percentage usage by those leaves is not as important. The
analysis assumes you want to catch every event that used at
least one ten-thousandth of the CPU to one significant
figure.
Profiling Boosts RTOS Performance
Performance is critical in many real-time systems. Getting the
most out of your CPU cycles is often the key to top
performance. Profiling and tuning code is one of the most
important ways to increase performance.
Statistical profiling is simple to use, minimally intrusive,
and flexible. As we have seen here, it is also capable of
accurate measurements. In just seconds, the profiler collects
enough samples for accuracy of less than 1%. Collecting for a
minute or two yields much higher accuracy.
Statistical profilers, such as Tornado's ProfileScope,
provide a detailed function-by-function analysis, breaking down
the individual routines within the task that are burning the
CPU. It creates a direct map of what the CPU is doing, what
routines are being called, what routines they call, and points
out exactly where the inefficiencies are. This information is
exactly what's needed to tune time-critical systems for maximum
performance.
The accuracy of a statistical profiler's measurement is
determined by the number of samples the profiler collects. We
seek to understand how long we must let the profiler collect
data to insure accuracy that will accomplish our purpose. For a
given collection rate, this means we need to calculate the
number of required samples.
The detailed statistics are beyond the scope of this paper.
We use a well-known result: for a confidence coefficient
g of a confidence interval e, you need at least
N samples of a binary process, where N is
conservatively estimated by:
where p is the probability of the binary event, q
= 1-p, and Z is the inverse of the Normal
distribution integral function. The binary event is the event
that a sample contains the routine in question. What this means
is that to be g sure that you are within an error of
e, you need to collect N samples.
Application 1: Optimizing
Code
Optimizing code so your program will run faster is by far the
most common application of profiling. To accomplish this, you
must identify routines that are using large percentages of the
CPU so you know where to concentrate your efforts to speed the
code.
Most profiling applications are well served with 95%
confidence to the nearest half a percent or so. What this
really means is that you are 95% sure that the numbers you are
looking at are within plus or minus .25% (a range of .5%) of
the true values. That level of accuracy is available almost
immediately. In particular, for a routine using 1% of the
CPU:
p = .01, q = .99, g = .95, e = .005, Z[...] = 1.96
This works out to only 1521 samples, or about 14 seconds of
data collection at ProfileScope's default 107Hz.
A common, very harsh statistical test is to be 95% sure of
two-digit accuracy. For this level of confidence in a
measurement of a routine using 1% of the processor, you would
have:
p = .01, q = .99, g = .95, e = .001, Z[...] = 1.96
Substituting, we find that we need 38032 samples, or about
six minutes worth at Profile's default 107Hz. If we increase
the sample rate to 500Hz (analyzing 500 samples a second is
about as fast as is reasonable for most applications), this
takes only about one minute. Thus, even this level of accuracy
is completely reasonable for a typical statistical profile.
To achieve this level of accuracy for a more rare event, say
one that uses only 0.1% of the processor, more samples are
required. Using p = .001 and e = .0001, this
accuracy requires 384160 samples, or about an hour at 100Hz, 12
minutes at 500Hz. This would require some patience, but as
noted above, this is an exceedingly fine-grained
dirt-on-the-car measurement.
Application 2: Timing
Routines
Suppose you have a device driver, and you want to measure the
time of execution of every routine in it, accurate to 1us.
We'll assume you put the driver into a tight loop, and in
that loop, it executes every 0.001 sec. (1000 times a second).
Suppose a routine within the driver takes 10us. In one second,
the CPU will spend 10us x 1000 = 10ms in this routine, so
p = .01. To get 1us accuracy, you will need to know the CPU
usage to 1ms (.001 sec) accuracy for the 1000 loops. As above,
this means p = .01, e = .001. It will take six
minutes to collect at the default 107Hz, one minute at
500Hz.
If the driver doesn't execute as often, you can use
selective sampling. ProfileScope supports selective sampling
through a simple on/off API. Restricting the sampling to only
your desired code sections allows a higher effective sampling
rate. This can dramatically shorten the time required to get
accurate results. Of course, this intense sampling may slow
down the system somewhat, but it may provide the data
needed.
Application 3: Analyzing Execution
Trees
Finally, building an execution tree that catches every routine
using one two-thousandths of the CPU with 95% probability would
only take one minute at 107Hz. Building an extremely accurate
tree by catching every event that used at least one
ten-thousandths of the CPU to one significant figure, (p
= .0001, e = .00005) requires 153664 samples, or about
24 minutes at 107Hz, five minutes at 500Hz.
About the Author

After serving as manager of the Stanford
University Aerospace Robotics Laboratory for three years,
Stan Schneider founded Real-Time Innovations in 1992 to
develop and market productivity tools for the real-time
marketplace. He is a recognized expert in real-time software
systems and architectures, with experience in data
acquisition systems for automotive impact testing; software
lead for PC and communications products; and management of,
and extensive consulting experience in, real-time digital
signal processing and computer systems.
He authored many innovative tools for the VxWorks environment
from Wind River Systems, including the RTILib package, as
well as the StethoScope and the original ControlShell system
from RTI. He also chairs the Wind River User's group.
Mr. Schneider completed his Ph.D. in Electrical Engineering
and Computer Science at Stanford University. He holds a B.S.
in Applied Mathematics (Summa Cum Laude) and an M.S. in
Computer Engineering from the University of Michigan.
References
Applied Statistics, by Neter, Wasserman, &
Whitmore; Allyn & Bacon, 1978. (See section 11.2)
Probability, Random Variables, and Stochastic
Processes; by Papoulis; McGraw-Hill, 1984. (See section
5-3)