Optimizing Program Performance
Capabilities and Limitations of Optimizing Compilers
Compilers must be careful to apply only safe optimizations to a program. The resuling program must have the exact same behavior as the unoptimized version.
Memory aliasing: Two pointers may designate the same memory location.
Side effect: The function that modifies some part of the global program state.
Expressing Program Performance
The CPE, or cycles per element, is a metric to express program performance that could be used to improve the code. The measurement is expressed in clock cycles.
Program Example
The definition of the vector
data structure:
Eliminating Loop Inefficiencies
The test codition is evaluated on every iteration of the loop. Therefore, the function vec_length(v)
is called on every iteration.
Code motion optimization involves identifying a computation that is performed multiple times. In this case, we move the computation of vec_length(v)
from within the loop to before the loop.
The compiler won't attempt to perform code motion, since it can't determine whether the function vec_length(v)
has side effects. If so, the behavior of combine2
will be different from combine1
.
Reducing Procedure Calls
Procedure calls can incur overhead and block most forms of program optimization.
The get_vec_start
returns the starting address of the data array. Rather than making a function call to retrieve each vector element, it accesses the array directly. However, it violates the principle of abstract data type, since it shouldn't know the detailed implementation.
Eliminating Unneeded Memory References
The accumulated value is read from and written to memory on each iteration. These operations could be eliminated by adding a temporary variable acc
to accumulate the computed value. The compiler won't attempt to elimnimate memory references due to memory aliasing.
Loop Unrolling
Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.
Reduce the number of opreations that don't contribute directly to the result of the program
Expose ways that could transform the code to reduce the number of operations in the critical path
'k x 1 loop unrolling' refers to the transformation that unroll by a factor of k
but accumulate values in a single variable.
The number of overhead operations is reduced, and then the latency bound of 1.00 becomes the performance limiting factor.
Enhancing Parallelism
Modern CPU Design
Definition: Most modern CPU are superscalar. A superscalar processor can issue and execut multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are scheduled dynamically.
Benifit: Without programming effort, superscalar processor can take advantage of the instruction level parallelism.
Pipelined functional units divides computation into stages and passes partial computation from stage to stage. Stage i
can start on new computation once values passed to i + 1
.
Multiple Accumulators
'k x l' loop unrolling refers to the transformation that unroll by a factor of k
but accumulate values in l
variables. (k
must be multiple of l
)
With multiple accumulators, acc0
and acc1
could be computed in parallel, thus reduce the critical path of the program by a factor of 2.
Reassociation Transformation
Reassociation transformation shifts the order in which the vector elements are combined with the accumulated value acc
, yielding a form of '2 x 1a' loop unrolling.
By a very small change in the code of combine5
, the way of combining could be fundamentally changed and thus increase the performance.
With reassociation transformation, the (data[i] OP data[i + 1])
and acc OP (data[i] OP data[i + 1])
could be computed in parallel, thus reduce the critical path of the program by a factor of 2.
Last updated