Fast JPEG codec for NVIDIA GPUsWe have created fast JPEG codec on NVIDIA CUDA technology. CUDA JPEG codec developed by Fastvideo is a blend of strict compliance with standards and shocking encoding and decoding speed comparing with the fastest existing commercial solutions. This is full, performance-oriented implementation of Baseline JPEG. We got ultra fast JPEG compression and decompression on the GPU due to full parallel implementation of Baseline JPEG algorithm. Our CUDA JPEG codec is the fastest in comparison with the best commercial multithreaded JPEG codecs for multicore CPUs. It's also faster than hardware accelerated JPEG codecs. Fast JPEG image compression features for CUDA JPEG codec
Why JPEG on CUDA could be so fast?We have succeeded to make parallel all stages of JPEG algorithm including entropy encoding and decoding. There was a widespread opinion that RLE and Huffman algorithms could be only serial. In our solution RLE and Huffman algorithms are not bottlenecks anymore and they are fully parallel. Now we don't off-load anything from GPU to CPU to make JPEG codec faster. CUDA JPEG codec is extremely fast and it's working on GPU. There are a lot of scientific papers about JPEG compression on CUDA, where authors try to accelerate baseline DCT module. The idea of parallel computations on CUDA leads to that task immediately, but this is just a small part of the whole solution for CUDA acceleration of JPEG algorithm. Parallel computing could be applied to all stages both of JPEG encoder and JPEG decoder. Image partitioning to a big amount of 8×8 or 16×16 blocks is a key feature to speedup JPEG codec on GPU. The most difficult part of JPEG algorithm is entropy codec, and we've accomplished that task on GPU as well. Our solution for fast JPEG on CUDA is working on GPU and we've accelerated all constituent parts of JPEG algorithm. This is actually the main idea of image processing speedup on CUDA: we have to create CUDA-based version for each algorithm that we have in our pipeline. And all our software was implemented according to that approach. Now we need just 0.51 ms for Baseline JPEG encoding of 24-bit color image with 4K resolution 3840 × 2160, JPEG quality 90% and subsampling 4:2:0 (it corresponds to image compression ratio ~10:1). We have chosen the above JPEG encoding parameters because they correspond to so called "visually lossless" compression. As far as concerns accuracy at JPEG algorithm on CUDA, we've implemented Color Transform, 2D DCT and Quantization in float and after that we round the result to improve accuracy in comparison with the conventional approach. These are the latest performance benchmarks for encoding of 2K and 4K images, 24-bit (JPEG compression on GPU, without DeviceIO latency, single image mode, no batch, no streaming) on NVIDIA GeForce GTX 1080 TI and Quadro P6000:
These are JPEG decoding performance benchmarks on NVIDIA GeForce GTX 1080 TI and Quadro P6000 (no DeviceIO latency, single image mode, no batch, no streaming):
The above results are much faster than benchmarks of libjpeg-turbo and turbojpeg on CPU. Even if we take into account host to device and device to host transfers, the performance of CUDA JPEG codec still will be much higher than libjpeg-turbo. More results for CUDA performance measurements you can see here. We've measured JPEG encoding performance on the NVIDIA GeForce RTX 2080TI for high resolution images from modern machine vision cameras. JPEG compression (image resolution 9433 × 7000, 24-bit, subsampling 4:2:0, quality 90) could be done within 3.3 ms which corresponds to performance ~60 GB/s. The latest benchmarks on the NVIDIA GeForce RTX 4090We've also measured JPEG encoding performance on the latest NVIDIA GeForce RTX 4090 for high resolution images from modern machine vision cameras. JPEG compression (image resolution 5328 × 4508, 24-bit, subsampling 4:2:0, quality 90) could be done within 0.65 ms which corresponds to performance ~117 GB/s. Assuming that real bandwidth of the PCIe-4.0 x16 interface is around 24 GB/s. this is almost 5 times faster. Options for Fast JPEG CodecWe have also included Fast JPEG codec to our main product - Fastvideo Image & Video Processing SDK. That SDK includes dark frame subtraction, shading correction, white balance, demosaicing, denoising, color correction, tone mapping, HDR, image filtering, 1D LUT, gamma, color management, 3D LUT, color grading, histogram, parade, resize, crop, rotate, remap, integral image, defringe, undistortion, sharp, OpenGL or GLFW output, integration with FFmpeg, raw bayer codec, J2K codec, MXF player, etc. Here you can see some benchmarks for JPEG compression and decompression, debayering, resizing, denoising, JPEG2000 on NVIDIA GeForce GTX 1080, Quadro P6000, Tesla V100, mobile Jetson Nano, TX2 and AGX Xavier. Licensing for Fast JPEG CodecWe license Fast JPEG Codec and other components of Fastvideo Image & Video Processing SDK to software developers, camera manufacturers and resellers, internet providers, system integrators, etc. Our SDK is utilized in wide range of imaging applications. Demo SDK, documentation, licensing info and quotation are available upon request. We are also offering custom software design according to agreed specification. If you need to get significant speedup on GPU for your image processing application, don't hesitate to contact us. More info about fast JPEG codec and Fastvideo SDKRoadmap for further improvements of Fast JPEG Codec
|