Accelerating Image Processing Pipelines in a Hardware/Software Environment

Heather Quinn* and Miriam Leeser
Northeastern University

Laurie Smith King
College of the Holy Cross

Abstract

Processing satellite images requires a sequence of image processing steps. Such an image processing pipeline can often benefit from acceleration with reconfigurable hardware. We have been developing a codesign environment that eases the process of creating reconfigurable applications for image processing using a library of image processing algorithms [1]. We have found that the performance of such an application is highly dependent on image size. While an application implemented on an FPGA can be one to two orders of magnitude faster than the application implemented in software, processing in hardware incurs additional costs that are not required for software. Some of these costs are hardware initialization costs, extra processing steps for easy processing of the border cases, and communication of the image to and from the reconfigurable device. The runtime of image processing applications varies with image size, so processing small images on an FPGA might not be efficient due to the additional overhead. Detailed profiling of hardware (with the extra costs) and software runtimes for a range of image sizes exposes a crossover point where all smaller images should be processed in software and all larger images should be processed in hardware. The goal of this project is to determine at runtime how to decide which implementation to execute for the optimal runtime for a given image.

Often a series of image processing transformations will be applied to an image. The series of transformations is called a pipeline and the individual transformations are called components. We assume that, for each component, there are hardware and software implementations and that the profiles for both implementations are known. We would like to profile a pipeline so that the an assignment of each component to an implementation yields the fastest pipeline for a given image size when all additional costs are taken into account. Ideally, we will be able to predict the behavior of a pipeline from the hardware/software profiles of the components and profiling of the extra costs. A pipeline profile is not easily characterized, though. A pipeline with N components will have 2N implementations. Each implementation of a pipeline will have a different profile, because the extra costs are different for each combination of implementations. At any given image size, a different implementation might give the fastest runtime. Pipelines also have an extra associated cost, as the FPGA needs to be reprogrammed to satisfy a given implementation. Finally, a pipeline’s profile is uniquely defined. Adding or removing even one component from the pipeline creates a completely different set of pipeline implementations and profiles. This problem of determining how to implement a pipeline is very close to the partitioning problem in hardware/software codesign.

Here are our assumptions about components and how components’ implementations combine:

    1. All components are image in/image out.
    2. Only one component is mapped to hardware at a time. The FPGA device needs to be reprogrammed for each hardware component being used.
    3. Communication costs are incurred at the beginning and end of each hardware sub-sequence.
    4. Some of the algorithms require that the edge be duplicated for processing the pixels on the boundary of the image. Software algorithms can duplicate the boundary when an edge pixel is being processed, but hardware algorithms cannot. These hardware algorithms require the edge be duplicated before transferring the image into SRAM. The edge duplication process (image padding) occurs in software.
    5. If there are back to back algorithms that require image padding, then there are these cases:

Case

Do

Hardware to Hardware

fix the image padding

Hardware to Software

remove image padding

Software to Hardware

pad the image

Software to Software

no changes

An example of how these costs come into play in a pipeline are shown in Figure 1 for the pipeline Median Filterà Edge Detection. The software and hardware boundaries are denoted as light gray boxes. Every time a hardware or software boundary is crossed a communication cost is incurred. The other costs, such as padding the image and reprogramming the hardware, are shown as dark gray boxes. From this figure, we can see that assigning a component implementation to its hardware or software implementation is not a simple "yes/no" assignment, since all of the extra costs need to be weighed as well as hardware/software algorithm speedup. Figure 2 shows a table of software and hardware runtimes based on image sizes and the crossover point for the individual Median Filter and Edge Detection components. We can see from these equations that Edge Detection has a better hardware speedup than Median Filter, which gives Edge Detection a smaller crossover point than Median Filter. The profiles of the pipeline implementations show that an all hardware implementation, shown in Figure 1d, has the fastest runtime for images greater than 1680 pixels. There is a distinct crossover point at 1680 pixels and all images smaller run fastest if processed in an all software implementation, as shown in Figure 1a. The runtimes of the pipeline implementations can be seen in Figure 3.

We have started preliminary work on predicting an image processing pipeline’s behavior using the component’s profiles. We have made some assumptions, such as the profile for using two components in series is additive. We have also profiled the costs, so that we can predict that behavior as well. With our example from above we were able to predict the behavior with up to 14% accuracy.

Choosing the proper implementation of a pipeline can have a direct effect of the performance of the pipeline. Overhead costs cannot be ignored. With known component profiles, the implementation of a pipeline can be predicted so that profiling the complete pipeline will not be necessary. We have shown a short example the shows how the Median Filter and Edge Detection components work together in an image processing pipeline and the extra costs incurred on the hardware/software boundaries.

 

Figure 1: Four HW/SW Assignments of Median Filter à Edge Detection

 

Algorithm

Software Runtime (ms)

Hardware Runtime (ms)

Crossover Point

Median Filter

0.083x + 169.72

0.0025x + 532.97

3600 pixels

Edge Detection

0.1502x + 65.061

0.0032x + 357.93

1680 pixels

Figure 2: Software/Hardware Runtimes as a Function of the Number of Pixels (x) and the Crossover Point for Median Filter and Edge Detection

Permutation

Runtime (ms)

Softwareà Software

0.2356x + 134.87

Softwareà Hardware

0.0906x + 358.73

Hardwareà Software

0.1519x + 253.69

Hardwareà Hardware

0.0036x + 459.22

Figure 3: Runtime as a Function of the Number of Pixels (x) for Each Implementation of the Median Filterà Edge Detection Pipeline

 

REFERENCES

[1] L.A. Smith King, Heather Quinn, Miriam Leeser, Demetris Galatopoullos, and Elias Manolakos, Runtime Execution of Reconfigurable Hardware in a Java Environment, (IEEE) International Conference on Computer Design (ICCD 2001), September 23-26, 2001, Austin, TX.