Compiler extensions alter AltiVec

By Kalpesh Gala, Product Manager, Motorola Inc., Austin, Texas, Mike Haden, Compiler Development Manager, Green Hills Software Inc., Santa Barbara, Calif., EE Times
November 14, 2000

The addition of Motorola's AltiVec technology to the PowerPC microprocessor architecture has created a general-purpose processor capable of handling many high-performance applications that traditionally required a combination of a single microprocessor to perform the system control function, and off-chip devices based on one or more other architectures. These include such things as a DSP farm or custom ASICs as coprocessors to perform specialized computations in applications such as networking, data communications and networked multimedia.

To take full advantage of this new capability, programmers need advanced compiler technology to handle vector extensions to high-level languages (HLLs). Such compilers exist, but further enhancements are possible. The next generation of compilers should be able to autovectorize existing code into highly optimized vectorized code.

Creating software that handles complex mathematical operations is never an easy task, but it becomes nearly impossible without the aid of high-level programming languages like C. Unfortunately, using HLLs can produce increased code size or reduced execution speed relative to hand-optimized software. The key to using a high-level language without paying a significant performance penalty is a good compiler.

While many good compilers exist for standard C programs, the addition of AltiVec technology to PowerPC has raised a new challenge and need for advancements in compilers. AltiVec technology provides simultaneous processing of items in parallel using a single-instruction-multiple-data (SIMD) processing paradigm, capable of handling 128 bits as a vector of four 32-bit, eight 16-bit or sixteen 8-bit elements.

Compilers must provide access to these new AltiVec instructions. Ideally they would automatically find opportunities for parallelization in the code. Such parallelization can greatly accelerate many repetitive numerical tasks, such as those found in signal processing, statistical analysis, image processing and financial calculations. Consider, for example, a simple addition of two arrays containing 1,024 floating-point numbers in a C routine that would require a "for" loop with 1,024 iterations. By grouping four floating-point numbers into vector arrays and performing a vector addition, the loop is reduced to 256 iterations, yielding a fourfold decrease in execution time using the AltiVec technology.

The addition of the vector-processing hardware to the MPC7400 presents the compiler with its first challenge, which is to provide software developers with access to the new AltiVec instruction set. While it is possible to provide access to the instruction set by allowing the inclusion of assembly-language code segments into an HLL program, such an approach offers none of the benefits that the HLL normally supplies. A better approach is to extend the HLL to include new data types and functions that have a strong, if not one-to-one, correlation with the new assembly instructions. Motorola understood the value of such extensions and took the opportunity to develop an AltiVec programming model. This model allows programmers to invoke high-level intrinsics in their application code that directly map to native assembly instructions. However, coding in this style is only valuable if the compiler implements the Motorola AltiVec programming model.

Some compilers do implement the model to simplify the use of the new AltiVec instructions and data types. These compilers group related assembly commands under a single, highly descriptive function name. The compiler selects the instruction that corresponds to the function for the data type that the programmer is using.

Several possible assembly instructions can be generated for the high-level "vec-add" operation, for example, when processed by a compiler implementing the AltiVec programming model. This translation from function-call-like-intrinsic to low-level assembly occurs at no penalty to program performance. It thus becomes the compiler's responsibility to choose the right assembly instruction based on variable data types.

High-level extensions can also help the developer avoid errors caused by mixing data types. The assembly instructions differ for signed vs. unsigned integers or 8-bit vs. 16-bit values, yet it is easy to lose track of which data type a variable uses when programming in assembly. However, having the instructions available through HLL extensions allows the error-checking algorithms of the compiler to detect mismatched data types before generating object code.

Leading-edge compilers do more than simply make the vector-processing instruction set more convenient. They also leverage the power of the new instructions to allow simpler system design. The AltiVec architecture, for instance, includes an extremely powerful Vector Permute Unit that allows a single instruction to reorder the bytes in a word to any pattern desired. By recognizing load and store operations to misaligned memory, such as I/O devices, the compiler can automatically insert permute operations to handle the alignment mismatch. This automatic compiler function simplifies the programmer's tasks.

These advanced compilers have the further requirement that they produce code that takes full advantage of the target architecture. In the case of the MPC7400, the architecture provides many opportunities for parallel code execution that the Green Hills compiler exploits.

For example, the processor has two integer execution units and a floating-point unit in addition to the AltiVec vector-processing engine. The vector-processing engine is further broken down into two execution units: the Vector Arithmetic Logic Unit and the Vector Permute Unit. These execution units, along with the floating point and integer units, are connected in parallel, allowing the processor to handle multiple pipelined operations simultaneously.

In addition, the MPC7400 instruction fetch unit can dispatch two instructions in a single clock cycle and allows for out-of-order instruction execution. A good compiler orders instruction execution to take advantage of the MPC7400's architecture. Thus the compiler reduces the execution time of the program and provides the best performance for the specific architecture.

The PowerPC architecture with AltiVec technology extensions provides several opportunities for parallelism. Not only can the vector arithmetic unit perform up to 16 operations in parallel; it also operates in parallel with the integer or floating-point units with no penalty to the programmer. Compilers can make the most of the opportunity by restructuring code to interleave instructions where possible.

By analyzing data dependencies and program flow, the Green Hills compiler reorders the final object code so that the dual-dispatch capability is fully utilized. Instead of having two vector arithmetic operations followed by two floating-point operations, the compiler reorders the code to interleave the vector and floating-point operations so that they execute simultaneously rather than sequentially.

Advanced compilers also restructure code to increase system performance. One such restructuring is to change loop structures to reduce the effects of loop overhead using a technique known as loop "unrolling." Loop unrolling amortizes the loop overhead for a longer period of computation by performing several instances of the loop operations instead of just one each iteration.

For example, consider the case of unrolling a vector addition loop by four. In this case, the cycles needed to keep track of the loop counter and to branch back to the top occur every four additions instead of every time. This results in an improvement in over-all loop execution. The Green Hills C Compiler can unroll loops in this manner, or to even greater depths, subject to explicit direction from the programmer.

Saving execution cycles is one of the goals of optimization, and using the right data structure can advance that goal. In the AltiVec vector processor, for example, the handling of complex numbers runs faster if the real and imaginary portions are in separate vectors rather than interleaved in a single vector. Although the developer codes a single complex vector, a compiler can separate complex vectors into two sets, allowing for faster handling. Many compilers allow you to save cycles by rearranging object code to hide data load latency. For example, by interleaving load and add instructions, the processor can perform one operation while data is loading for another. Having the compiler automatically generate such code patterns saves programmers from taking the time to make such optimizations by hand.

Another optimization technique is to arrange data in main memory so that related vector sets stay together. AltiVec technology provides instructions that let developers identify vector chains, groups of vectors that are used together in the same process. The instructions allow the developer to force the processor to keep related data in the cache by overriding block replacement that might remove a section of the chain.

Cache overwrites for chains become an all-or-none operation. The compiler can ensure that such chains are placed in main memory at page boundaries so that loading them is quicker. Further, it can identify groups of data as chains automatically and insert the appropriate commands.

Most of the optimizations discussed have already been introduced into compilers such as the Green Hills Software's Optimizing C Compiler for Motorola's MPC7400 microprocessor with AltiVec technology. The next step for Green Hills, and other developers, is to make vector processing automatic when data structures and repetitive operations allow it. For example, to make a simple addition a vector loop currently requires the software developer to structure the data and code appropriately in the source file. A compiler could recognize such loops and vectorize them automatically, eliminating the need for the developer to identify such opportunities.

Providing automatic vectorization is not easy. It requires compiler technology that recognizes vector patterns and handles boundary conditions without the benefit of programmer insight. The benefits, however, would be tremendous. Automatic vectorization would take any general-purpose existing source code and make immediate use of the power that vector processing brings to the microprocessor.

It would allow developers to gain the advantage of parallel operation without hand coding or even understanding the rationale behind the vector approach.


Copyright© 2000 by CMP Media LLC, 600 Community Drive, Manhasset, NY 11030.
Reprinted from Electronic Engineering TIMES with permission. 5199