Accelerating Image Processing Pipelines in a Hardware/Software Environment
Heather Quinn* and Miriam Leeser
Northeastern University
Laurie Smith King
College of the Holy Cross
Abstract
Processing satellite images requires a sequence of image processing steps. Such an image processing pipeline can often benefit from acceleration with reconfigurable hardware. We have been developing a codesign environment that eases the process of creating reconfigurable applications for image processing using a library of image processing algorithms [1]. We have found that the performance of such an application is highly dependent on image size. While an application implemented on an FPGA can be one to two orders of magnitude faster than the application implemented in software, processing in hardware incurs additional costs that are not required for software. Some of these costs are hardware initialization costs, extra processing steps for easy processing of the border cases, and communication of the image to and from the reconfigurable device. The runtime of image processing applications varies with image size, so processing small images on an FPGA might not be efficient due to the additional overhead. Detailed profiling of hardware (with the extra costs) and software runtimes for a range of image sizes exposes a crossover point where all smaller images should be processed in software and all larger images should be processed in hardware. The goal of this project is to determine at runtime how to decide which implementation to execute for the optimal runtime for a given image.
Often a series of image processing transformations will be applied to an image. The series of transformations is called a pipeline and the individual transformations are called components. We assume that, for each component, there are hardware and software implementations and that the profiles for both implementations are known. We would like to profile a pipeline so that the an assignment of each component to an implementation yields the fastest pipeline for a given image size when all additional costs are taken into account. Ideally, we will be able to predict the behavior of a pipeline from the hardware/software profiles of the components and profiling of the extra costs. A pipeline profile is not easily characterized, though. A pipeline with N components will have 2N implementations. Each implementation of a pipeline will have a different profile, because the extra costs are different for each combination of implementations. At any given image size, a different implementation might give the fastest runtime. Pipelines also have an extra associated cost, as the FPGA needs to be reprogrammed to satisfy a given implementation. Finally, a pipeline’s profile is uniquely defined. Adding or removing even one component from the pipeline creates a completely different set of pipeline implementations and profiles. This problem of determining how to implement a pipeline is very close to the partitioning problem in hardware/software codesign.
Here are our assumptions about components and how components’ implementations combine:
|
Case |
Do |
|
Hardware to Hardware |
fix the image padding |
|
Hardware to Software |
remove image padding |
|
Software to Hardware |
pad the image |
|
Software to Software |
no changes |
An example of how these costs come into play in a pipeline are shown in Figure 1 for the pipeline Median Filterà Edge Detection. The software and hardware boundaries are denoted as light gray boxes. Every time a hardware or software boundary is crossed a communication cost is incurred. The other costs, such as padding the image and reprogramming the hardware, are shown as dark gray boxes. From this figure, we can see that assigning a component implementation to its hardware or software implementation is not a simple "yes/no" assignment, since all of the extra costs need to be weighed as well as hardware/software algorithm speedup. Figure 2 shows a table of software and hardware runtimes based on image sizes and the crossover point for the individual Median Filter and Edge Detection components. We can see from these equations that Edge Detection has a better hardware speedup than Median Filter, which gives Edge Detection a smaller crossover point than Median Filter. The profiles of the pipeline implementations show that an all hardware implementation, shown in Figure 1d, has the fastest runtime for images greater than 1680 pixels. There is a distinct crossover point at 1680 pixels and all images smaller run fastest if processed in an all software implementation, as shown in Figure 1a. The runtimes of the pipeline implementations can be seen in Figure 3.
We have started preliminary work on predicting an image processing pipeline’s behavior using the component’s profiles. We have made some assumptions, such as the profile for using two components in series is additive. We have also profiled the costs, so that we can predict that behavior as well. With our example from above we were able to predict the behavior with up to 14% accuracy.
Choosing the proper implementation of a pipeline can have a direct effect of the performance of the pipeline. Overhead costs cannot be ignored. With known component profiles, the implementation of a pipeline can be predicted so that profiling the complete pipeline will not be necessary. We have shown a short example the shows how the Median Filter and Edge Detection components work together in an image processing pipeline and the extra costs incurred on the hardware/software boundaries.

Figure 1: Four HW/SW Assignments of Median Filter à Edge Detection
|
Algorithm |
Software Runtime (ms) |
Hardware Runtime (ms) |
Crossover Point |
|
Median Filter |
0.083x + 169.72 |
0.0025x + 532.97 |
3600 pixels |
|
Edge Detection |
0.1502x + 65.061 |
0.0032x + 357.93 |
1680 pixels |
Figure 2: Software/Hardware Runtimes as a Function of the Number of Pixels (x) and the Crossover Point for Median Filter and Edge Detection
|
Permutation |
Runtime (ms) |
|
Softwareà Software |
0.2356x + 134.87 |
|
Softwareà Hardware |
0.0906x + 358.73 |
|
Hardwareà Software |
0.1519x + 253.69 |
|
Hardwareà Hardware |
0.0036x + 459.22 |
Figure 3: Runtime as a Function of the Number of Pixels (x) for Each Implementation of the Median Filterà Edge Detection Pipeline
REFERENCES
[1] L.A. Smith King, Heather Quinn, Miriam Leeser, Demetris Galatopoullos, and Elias Manolakos, Runtime Execution of Reconfigurable Hardware in a Java Environment, (IEEE) International Conference on Computer Design (ICCD 2001), September 23-26, 2001, Austin, TX.