Optimizing Program Performance

Capabilities and Limitations of Optimizing Compilers

Compilers must be careful to apply only safe optimizations to a program. The resuling program must have the exact same behavior as the unoptimized version.

Memory aliasing: Two pointers may designate the same memory location.
Side effect: The function that modifies some part of the global program state.

Expressing Program Performance

The CPE, or cycles per element, is a metric to express program performance that could be used to improve the code. The measurement is expressed in clock cycles.

Program Example

The definition of the vector data structure:

typedef struct {
  long len;
  data_t *data; // type of the underlying element
} vec_rec, *vec_ptr;

void combine1(vec_ptr v, data_t *dest) {
  long i;
  *dest = IDENT;
  for (i = 0; i < vec_length(v); i++) {
    data_t val;
    get_vec_element(v, i, &val);
    *dest = *dest OP val;
  }
}

Eliminating Loop Inefficiencies

The test codition is evaluated on every iteration of the loop. Therefore, the function vec_length(v) is called on every iteration.

Code motion optimization involves identifying a computation that is performed multiple times. In this case, we move the computation of vec_length(v) from within the loop to before the loop.

void combine2(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);

  *dest = IDENT;
  for (i = 0; i < length; i++) {
    data_t val;
    get_vec_element(v, i, &val);
    *dest = *dest OP val;
  }
}

The compiler won't attempt to perform code motion, since it can't determine whether the function vec_length(v) has side effects. If so, the behavior of combine2 will be different from combine1.

Reducing Procedure Calls

Procedure calls can incur overhead and block most forms of program optimization.

The get_vec_start returns the starting address of the data array. Rather than making a function call to retrieve each vector element, it accesses the array directly. However, it violates the principle of abstract data type, since it shouldn't know the detailed implementation.

data_t *get_vec_start(vec_ptr v) {
  return v->data;
}

void combine3(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  data_t *data = get_vec_start(v);

  *dest = IDENT;
  for (i = 0; i < length; i++) {
    *dest = *dest OP data[i];
  }
}

Eliminating Unneeded Memory References

The accumulated value is read from and written to memory on each iteration. These operations could be eliminated by adding a temporary variable acc to accumulate the computed value. The compiler won't attempt to elimnimate memory references due to memory aliasing.

void combine4(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  data_t *data = get_vec_start(v);
  data_t acc = IDENT;

  for (i = 0; i < length; i++) {
    acc = acc OP data[i];
  }

  *dest = acc;
}

Loop Unrolling

Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.

Reduce the number of opreations that don't contribute directly to the result of the program
Expose ways that could transform the code to reduce the number of operations in the critical path

'k x 1 loop unrolling' refers to the transformation that unroll by a factor of k but accumulate values in a single variable.

void combine5(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  long limit = length - 1;
  data_t *data = get_vec_start(v);
  data_t acc = IDENT;

  // Combine 2 elements at a time
  for (i = 0; i < limit; i+=2) {
    acc = (acc OP data[i]) OP data[i + 1];
  }

  // Unroll remaining elements
  for (; i < length; i++) {
    acc = acc OP data[i];
  }

  *dest = acc;
}

The number of overhead operations is reduced, and then the latency bound of 1.00 becomes the performance limiting factor.

Enhancing Parallelism

Modern CPU Design

Definition: Most modern CPU are superscalar. A superscalar processor can issue and execut multiple instructions in one cycle. The instructions are retrieved from a sequential instruction stream and are scheduled dynamically.

Benifit: Without programming effort, superscalar processor can take advantage of the instruction level parallelism.

Pipelined functional units divides computation into stages and passes partial computation from stage to stage. Stage i can start on new computation once values passed to i + 1.

Multiple Accumulators

'k x l' loop unrolling refers to the transformation that unroll by a factor of k but accumulate values in l variables. (k must be multiple of l)

void combine6(vec_ptr v, data_t *dest) {
  long i;
  long length = vec_length(v);
  long limit = length - 1;
  data_t *data = get_vec_start(v);
  data_t acc0 = IDENT;
  data_t acc1 = IDENT;

  // Combine 2 elements at a time
  for (i = 0; i < limit; i+=2) {
    acc0 = acc0 OP data[i];
    acc1 = acc1 OP data[i + 1];
  }

  // Unroll remaining elements
  for (; i < length; i++) {
    acc = acc OP data[i];
  }

  *dest = acc0 OP acc1;
}

With multiple accumulators, acc0 and acc1 could be computed in parallel, thus reduce the critical path of the program by a factor of 2.

Reassociation Transformation

Reassociation transformation shifts the order in which the vector elements are combined with the accumulated value acc, yielding a form of '2 x 1a' loop unrolling.

By a very small change in the code of combine5, the way of combining could be fundamentally changed and thus increase the performance.

// combine5
acc = (acc OP data[i]) OP data[i + 1];

//combine7
acc = acc OP (data[i] OP data[i + 1]);

With reassociation transformation, the (data[i] OP data[i + 1]) and acc OP (data[i] OP data[i + 1]) could be computed in parallel, thus reduce the critical path of the program by a factor of 2.

PreviousMachine Level Programming NextThe Memory Hierarchy

Last updated 3 years ago