Fast JPEG codec for NVIDIA GPUs

We have created fast JPEG codec on NVIDIA CUDA technology. CUDA JPEG codec developed by Fastvideo is a blend of strict compliance with standards and shocking encoding and decoding speed comparing with the fastest existing commercial solutions. This is full, performance-oriented implementation of Baseline JPEG. We got ultra fast JPEG compression and decompression on the GPU due to full parallel implementation of Baseline JPEG algorithm. Our CUDA JPEG codec is the fastest in comparison with the best commercial multithreaded JPEG codecs for multicore CPUs. It's also faster than hardware accelerated JPEG codecs.

Fast JPEG image compression features for CUDA JPEG codec

Implementation is 100% compliant with JPEG Baseline Standard

Baseline JPEG compression and decompression for grayscale (8-bit) and color (24-bit) images with arbitrary width and height

Optional 12-bit JPEG compression for grayscale and color images

Extremely fast lossy image encoding and decoding with variable compression ratio

Subsampling modes: 4:4:4, 4:2:2, 4:2:0

Minimum input image size is 1×1 for grayscale and color images with any subsampling

Maximum input image size is 16,000 × 16,000 or more (optional)

JPEG image quality in the range from 1 to 100

Read/edit/write any EXIF section

Optional parameters: quantization tables for Y and Cb/Cr

Data input: 8/24-bit or 12/36-bit images from RAM/HDD/RAID/SSD/GPU

Data output: final compressed/uncompressed 8/24-bit or 12/36-bit image in RAM/HDD/RAID/SSD/GPU

Standard input formats: PGM, YUV, PPM, BMP, JPG

Continuous data mode (input one image after another)

JPEG Encoding on GPU: Input data parsing, Color Transform, 2D DCT, Quantization, Zig-zag, AC/DC, DPCM, RLE, Huffman coding, Byte stuffing, JFIF formatting

JPEG Decoding on GPU: JFIF parsing, Restart marker search, Inverse Huffman decoding, Inverse RLE, Inverse DPCM, AC/DC, Inverse Zig-zag, Inverse Quantization, Inverse DCT, Inverse Color Transform, Output formatting

Optimized for the latest NVIDIA GPUs

Compatibility with FFmpeg to read/write MJPEG streams (FFmpeg is under LGPL v2.1)

Optional integration with OpenGL

Optional support for input from HD-SDI cards (Blackmagic, Bluefish, Deltacast, Imperx)

Compatible with Windows-7/8/10 and Linux Ubuntu/CentOS, L4T

Why JPEG on CUDA could be so fast?

We have succeeded to make parallel all stages of JPEG algorithm including entropy encoding and decoding. There was a widespread opinion that RLE and Huffman algorithms could be only serial. In our solution RLE and Huffman algorithms are not bottlenecks anymore and they are fully parallel. Now we don't off-load anything from GPU to CPU to make JPEG codec faster. CUDA JPEG codec is extremely fast and it's working completely on GPU.

There are a lot of scientific papers about JPEG compression on CUDA, where authors try to accelerate baseline DCT module. The idea of parallel computations on CUDA leads to that task immediately, but this is just a small part of the whole solution for CUDA acceleration of JPEG algorithm. Parallel computing could be applied to all stages both of JPEG encoder and JPEG decoder. Image partitioning to a big amount of 8×8 or 16×16 blocks is a key feature to speedup JPEG codec on GPU. The most difficult part of JPEG algorithm is entropy codec, and we've accomplished that task on GPU as well. Our solution for fast JPEG on CUDA is working on GPU and we've accelerated all constituent parts of JPEG algorithm. This is actually the main idea of image processing speedup on CUDA: we have to create CUDA-based version for each algorithm that we have in our pipeline. And all our software was implemented according to that approach.

Now we need just 0.51 ms for Baseline JPEG encoding of 24-bit color image with 4K resolution 3840 × 2160, JPEG quality 90% and subsampling 4:2:0 (it corresponds to image compression ratio ~10:1). We have chosen the above JPEG encoding parameters because they correspond to so called "visually lossless" compression.

As far as concerns accuracy at JPEG algorithm on CUDA, we've implemented Color Transform, 2D DCT and Quantization in float and after that we round the result to improve accuracy in comparison with the conventional approach.

These are the latest performance benchmarks for encoding of 2K and 4K images, 24-bit (JPEG compression on GPU, without DeviceIO latency, single image mode, no batch, no streaming) on NVIDIA GeForce GTX 1080 TI and Quadro P6000:

Full HD (2K, 1920 × 1080) ~ 35 GByte/s (0.17 ms)
4K (3840 × 2160) ~ 46 GByte/s (0.51 ms)

These are JPEG decoding performance benchmarks on NVIDIA GeForce GTX 1080 TI and Quadro P6000 (no DeviceIO latency, single image mode, no batch, no streaming):

Full HD (2K, 1920 × 1080) ~ 5.3 GByte/s (1.2 ms)
4K (3840 × 2160) ~ 11.2 GByte/s (2.12 ms)

The above results are much faster than benchmarks of libjpeg-turbo and turbojpeg on CPU. Even if we take into account host to device and device to host transfers, the performance of CUDA JPEG codec still will be much higher than libjpeg-turbo. More results for CUDA performance measurements you can see here.

We've measured JPEG encoding performance on the NVIDIA GeForce RTX 2080TI for high resolution images from modern machine vision cameras. JPEG compression (image resolution 9433 × 7000, 24-bit, subsampling 4:2:0, quality 90) could be done within 3.3 ms which corresponds to performance ~60 GB/s.

The latest benchmarks on the NVIDIA GeForce RTX 4090

We've also measured JPEG encoding performance on the latest NVIDIA GeForce RTX 4090 for high resolution images from modern machine vision cameras. JPEG compression (image resolution 5328 × 4508, 24-bit, subsampling 4:2:0, quality 90) could be done within 0.65 ms which corresponds to performance ~117 GB/s. This is three times more than max bandwidth of PCIe-4.0 interface.

Options for Fast JPEG Codec

We have also included Fast JPEG codec to our main product - Fastvideo Image & Video Processing SDK. That SDK includes dark frame subtraction, shading correction, white balance, demosaicing, denoising, color correction, tone mapping, HDR, image filtering, 1D LUT, gamma, color management, 3D LUT, color grading, histogram, parade, resize, crop, rotate, remap, integral image, defringe, undistortion, sharp, OpenGL or GLFW output, integration with FFmpeg, raw bayer codec, J2K codec, MXF player, etc.

Here you can see some benchmarks for JPEG compression and decompression, debayering, resizing, denoising, JPEG2000 on NVIDIA GeForce GTX 1080, Quadro P6000, Tesla V100, mobile Jetson Nano, TX2 and AGX Xavier.

Licensing for Fast JPEG Codec

We license Fast JPEG Codec and other components of Fastvideo Image & Video Processing SDK to software developers, camera manufacturers and resellers, internet providers, system integrators, etc. Our SDK is utilized in wide range of imaging applications. Demo SDK, documentation, licensing info and quotation are available upon request. We are also offering custom software design according to agreed specification. If you need to get significant speedup on GPU for your image processing application, don't hesitate to contact us.

More info about fast JPEG codec and Fastvideo SDK

Roadmap for further improvements of Fast JPEG Codec

Fast JPEG codec integration into GPU RAW Processor software - done
Fast VCR software for XIMEA cameras (realtime raw image processing with output JPEG encoding and integrated camera control) - done
Further optimizations for Fast JPEG codec - in progress
Minimum memory usage on GPU - in progress