Performance of CUDA JPEG Encoding exceeds two times the bandwidth of PCI-Express 3.0 x16
Fastvideo company has released super fast CUDA JPEG encoder. Performance of the encoder for NVIDIA GeForce GTX 980 could be more than 20 GByte per second for images loaded into GPU memory, which is two times more than PCIE-3.0 x16 bandwidth. CUDA JPEG encoder from Fastvideo was the fastest on the market and now it's three times faster in comparison with the previous version.
In 2011 Fastvideo pioneered the first fully parallel JPEG codec for NVIDIA GPUs. Since then there was a lot of progress both with NVIDIA hardware and Fastvideo software. As a result, we have achieved exceptionally high performance for JPEG compression on GPU. We've got an answer to the question "What could be done faster: to send uncompressed 4K image from CPU to GPU over PCI Express 3.0 x16 or to do JPEG compression on GPU?" Since now JPEG encoding could be two times faster. This is a new reality and a new level of modern hardware and software.
Fast JPEG compression is a must in various media, industrial, scientific, medical and other applications. Nowadays quite standard task is long-term realtime video recording for cameras with very high resolution or high frame rate. JPEG is the most common format for image storage. Massive JPEG handling is important for web and currently it's possible to resize more than million of JPEG images per hour at just one GPU. With current advances in JPEG encoding performance, the scope of JPEG usage will definitely be expanded, especially for realtime applications with 4K and 8K resolutions.
If we compare time which is necessary to send image data from PC RAM to GPU memory and JPEG encoding time on GPU, we will clearly see that JPEG compression could be much faster than data transfer over PCI Express 3.0 x16 bus. Sending 24-bit 4K image with resolution 3840 x 2160 from CPU to GPU over PCI-E 3.0 x16 takes about 2.17 ms. JPEG encoding time on GPU for the same image with compression ratio ~10:1 (JPEG quality 90%) and color subsampling 4:2:0 is about 1.13 ms on NVIDIA GeForce GTX 980. This outstanding result comes from powerfull NVIDIA hardware and from highly optimized massive parallel implementation of JPEG algorithm from Fastvideo. Now we need less time to do JPEG compression on NVIDIA GPU than to send the same uncompressed image over PCI-E 3.0 x16.
Fig.1: Timing for JPEG encoding vs. PCI-Express 3.0 x16 data transfer for 24-bit image with 4K resolution. Ultra HD: 3840 x 2160, JPEG quality 90% (compression ratio ~10:1), color subsampling 4:2:0. Time measurements were done on NVIDIA Visual Profiler. OS Windows-7, CUDA-6.5 (32-bit).
The idea of full image processing pipeline on GPU is very promising. This is the way to avoid unnecessary data transfers over PCIE bus and to significantly improve total performance and reduce latency due to parallel algorithms for image and video processing. That idea is successfully implemented in GPU Image Processing SDK from Fastvideo. Many cameras are already working in realtime with that software while doing all image processing on GPU.
JPEG codec from Fastvideo is available as a part of GPU Image Processing SDK for Windows-7/8/10 and Linux (both 32-bit and 64-bit) and for NVIDIA GPUs with Fermi, Kepler and Maxwell architectures. Demo version of JPEG codec for NVIDIA GPUs is available from Fastvideo website and works under Windows-7/8/10. Fastvideo SDK trial is available upon request.
Fastvideo was founded in 2009 in Dubna, Russia. Company is specializing in high speed camera design and GPU image and video processing. The most powerful product of Fastvideo is high performance SDK for realtime image and video processing on NVIDIA GPUs.
Download page: https://www.fastcompression.com/download/download.htm
Here one can find the latest results for performance of CUDA JPEG Codec.