Fast Discrete Wavelet Transform on CUDA
Wavelet analysis, which is the successor of the Fourier analysis, is based on the idea that the same information, the same signal can be represented in different forms, depending on the purpose. To translate a signal into another form, a basis is first chosen - a set of linearly independent functions, and the signal is represented as a sum of these functions with some coefficients. In the Fourier analysis, the basis consists of periodic harmonic functions sin(k*x) and cos(k*x), corresponding to oscillations with a certain frequency throughout the domain of definition. Thus, having obtained the Fourier transform of a signal, we can determine its frequency spectrum and see which oscillations it consists of. If the signal characteristics change (for example, in time for sound signals or in space for images), then we can decompose it into components by using wavelet analysis, in which the basis consists of shifts and scales of a certain mother wavelet function. In such representation of a signal, we can track changes in the frequency spectrum, as well as study the local structure of the signal at a certain frequency, at a certain scale.
The representation of data in the form of wavelet coefficients is usually more compact than the original representation. From the perspective of the information theory, such a representation has less entropy. Moreover, such a representation also allows to obtain an approximation of the original signal with a given accuracy, by using only a fraction of the wavelet coefficients. Due to this, 2D Discrete Wavelet Transform is actively used in image compression (e.g. in popular formats JPEG2000 and DJVU). And due to the ability of transition to multi-scale representation of data, it is also used in image enhancement algorithms, such as noise reduction, filtering, fast scaling, and in artificial intelligence applications, such as feature extraction and segmentation, in compressed sensing applications during signal reconstruction.
The Discrete Wavelet Transform (DWT) of a signal is obtained by sequential applying of two filters (low-pass and high-pass) with decimation to eliminate redundancy. Such an algorithm is called the Mallat algorithm and can be used both for decomposition and for reconstruction. As a result, a signal is decomposed into two components: approximation (lower resolution version) and details. By applying this decomposition repeatedly to low-frequency coefficients, a multi-scale decomposition is obtained, which is called dyadic. This allows us to obtain (and to adjust) information on a wider frequency spectrum: at each decomposition step, the frequency spectrum is doubled, and the minimum resolution is halved.
Since images are two-dimensional, their DWT-transformation is performed both vertically and horizontally. The result is four fragments (subbands) with half the width and half the height, one of which is a smaller copy of the image (LL subband), and the three remaining subbands contain information about the details – horizontal (HL), vertical (LH) and diagonal (HH). At each subsequent step of decomposition, the LL subband is replaced by four smaller subbands, so the total number of subbands increases by three. Similarly, an algorithm is constructed for signals of a higher dimension (e.g. for video sequences as 3D signals or for time-varying computed tomography results as 4D signals).
Fig. 1. Subbands of 2D wavelet coefficients after the first and the second DWT of an image
Many applications of the DWT are time-critical. Therefore, computation of the transform is often boosted by using specialized processors or accelerators, such as FPGA, accelerators with Intel MIC architecture, GPU graphics processors. GPUs have an excellent price-performance ratio and performance to power consumption ratio. Moreover, high-level programming platforms OpenCL and NVIDIA CUDA have contributed to their widespread use in various calculations due to the smooth learning curve, detailed documentation, and an extensive set of examples.
How to implement fast parallel DWT processing?
As a rule, instead of the Mallat separable convolutional algorithm, a lifting scheme is used for computation of DWT, because it has a number of theoretical and practical advantages. Among the theoretical advantages: it is easier to understand and explain (it is not relied on the Fourier transform), and the inverse transform is easily obtained by performing lifting steps in the reverse order. Moreover, if we use integer arithmetic, the absence of loss of accuracy (perfect reversibility) is guaranteed. Practical advantages are a smaller number of arithmetic operations (up to a factor of two), the ability to transform signals of arbitrary length (not only a multiple of 2n), possibility of fully in-place calculation (no auxiliary memory is needed). In addition, the lifting scheme can be used to construct wavelets of the second generation, in which the basis does not necessarily consist of shifts and scales of a certain function.
A considerable amount of research was performed on implementation of the wavelet transform on various platforms, including modern GPUs. The first studies were devoted to the comparison of the separable convolution-based algorithm and the lifting scheme, then the attention of researchers was completely switched to the effective implementation of the lifting scheme. The main difficulty that must be overcome when implementing a lifting scheme on GPU is the need of data exchange between threads after each lifting step. On GPU, data exchange within the threads of a single thread block is performed via shared memory, the size of which is very limited, while data exchange between thread blocks is relatively slow due to global memory access and synchronization of all threads of the kernel. Therefore, in order to achieve the best performance, it is necessary to choose such size of thread blocks and such size of arrays in shared memory that would balance the time for threads synchronization and the time for redundant calculations that are inevitable when the source data is divided into blocks.
For parallel processing of a two-dimensional image on the GPU, different strategies were proposed that can be divided into three groups:
RC methods transform an image as a whole, and a transposition operation is often used to speed up the vertical transformation. When the original image is too large, implementations from the first group will suffer from a lack of shared memory. Moreover, the performance of this approach is limited due to the need to write intermediate results (after column transformation) to global memory.
BB methods transform an image by dividing it into smaller blocks, and the row-column method is applied to each of them separately. Note that the correct result is achieved only in the case when these blocks overlap, where the blocks are extended in each direction on a few pixels depending on the type of wavelet. Such implementations use a limited amount of shared memory, independent of the image size, and they access global memory less frequently compared to RC methods.
Finally, methods of the third group transform an image by using column slabs with sliding window within each slab. The window allows you to keep intermediate results in shared memory longer. As a result, such methods can be considered a hybrid of the first two approaches.
After optimizations and implementations of all three methods, the researchers showed that the BB method is the most efficient, so we've chosen it for our implementation. Within this approach, the original image is divided into blocks of width M and height N, which need to be expanded by four pixels (four for a 4-step lifting scheme of CDF 9/7 and two for a 2-step lifting scheme of CDF 5/3) in each direction. As a result, intersecting blocks of size (M + 8) × (N + 8) can be processed without separate conditions for the boundary pixels. Parameters M and N are chosen empirically to maximize the performance.
Wavelets CDF 9/7 and CDF 5/3
The Cohen-Daubechies-Feauveau (CDF) wavelets belong to the family of biorthogonal wavelets, which, unlike orthogonal wavelets, can be symmetric. They were made popular by Ingrid Daubechies, and two members of this wavelet family (CDF 9/7 and CDF 5/3) are now used in the JPEG2000 image compression. The CDF 9/7 wavelet has four vanishing moments, which means that the corresponding high-frequency wavelet coefficients of a cubic polynomial signal are zero. The CDF 5/3 wavelet has only two vanishing moments, but its main feature is the existence of an integer implementation, which has no loss of accuracy (arising with most of the other wavelets due to rounding of irrational numbers). Due to this property of CDF 5/3, JPEG2000 algorithm has lossless compression mode.
Hardware and software for DWT benchmarking
GPU DWT benchmarks
We've measured the DWT performance for CDF 9/7 and CDF 5/3 wavelets on 24-bit images (8 bits in each of 3 color channels) of different sizes (up to 8K Ultra HD). The number of DWT levels ranged from 1 to 5. The obtained results include only the processing time on GPU, because it is assumed that the original data is already in the GPU memory, and the result of the transformation will be written to the same memory.
From the tables and the figures you can see, that the two-level DWT is ~40% slower than the single-level DWT, while the five-level DWT is ~60% slower. It could be also seen, that the performance of DWT saturates on large images when there is enough work to load each multiprocessor of GPU.
Optimized DWT implementations of CDF 5/3 and CDF 9/7 are utilized in our JPEG2000 codec on CUDA.
Table 1. GPU DWT performance for 4K image (3840×2160, 24-bit)
Table 2. GPU DWT performance for 8K image (7680×4320, 24-bit)
Fig. 2. Fastvideo DWT performance benchmarks on GeForce GTX 1080 for images of different sizes
Fig. 3. Fastvideo DWT frame rate on GeForce GTX 1080 for images of different sizes