Loop unrolling expands the content of a control loop, and in some cases eliminates the need of the control loop entirely. The goal of this technique is to minimize the execution time of a coded algorithm. Unrolling loops helps to minimize the number instructions because the number of end of loops check is reduced, and it helps the compiler generate assembly that uses parallel instructions. To learn how this technique are helpful, let’s look at the example code in the first image below.
Figure 1. Convolution Routines – unoptimized
This is performing a convolution on the y input filter coefficients, h. An optimizing compiler will not be able to unroll and optimize this function on its own. This is due to the fact that the bounds on the trip count of the inner loop are unknown and changes per iteration of the outer loop. However, if the size of filter is known to be a constant, then the number of iterations can at least be predictive. For example, if we know the length of the convolution filter to be 5, then the trip count will be 5 for the 1st iteration of the outer loop, 4 for the 2nd iteration, and so on so forth. If we perform a loop unrolling at the C level, the compiler will have a much easier time parallelizing the instructions. Figure 2, shows the convolve() function with a filter length of 5 completely unrolled.
Figure 2. Convolution Routine – unrolled
To help with the visualization of the unrolling, the acc variable was turned into an array. Each index represents the accumulation result of each iteration of the outer loop. Now, with the expanded for-loops the pointer arithmetic and end of loop checks have been eliminated. In addition, on platforms that support parallelization, the compiler will produce better optimized code now the exact set of instructions that need to performed are presented.