CUDA Denoiser filter for camera software

Denoising is widely used in many camera applications, especially for solutions with low-light illumination. We have developed several CUDA-accelerated denoise kernels which run on existing hardware from NVIDIA on Windows/Linux/ARM. We've got very high performance both for image and video denoisers on CUDA.

CUDA Denoiser Library Features

  • Input format: 8/10/12/14/16-bit per channel input data array from CPU or GPU memory
  • Output format: 24/48-bit output data array in CPU or GPU memory
  • Denoising with 16/32-bit accuracy
  • High speed denoiser without AI
  • Denoiser algorithms
    • Wavelet denoiser (raw and rgb) CDF 5/3 and CDF 9/7 with Hard, Soft, Garrote thresholding
    • Bilateral denoiser
    • NLM denoiser
  • Compatibility with FastVCR software for machine vision cameras
  • Timing and performance measurements
  • OS Windows-10/11, Linux Ubuntu and L4T (Jetson)
  • Compatibility with NVIDIA GPUs (Jetson, GeForce, Quadro, Tesla), cc >=5.0, CUDA-12.3
cuda denoiser filters

Benchmarks for CUDA Denoiser

Image resolution: 4112×2176 (8.9 MPix), 16-bit per channel, RGB

Test description: all data in GPU memory, timing includes CUDA computations only

2D Wavelet transform: CDF 9/7
Number of DWT resolutions: up to 7
DWT thresholds for YCbCr: 80;150;150

NLM denoiser parameters: blur window 3×3 and more, search window 3×3 and more, strength 1-3000
That algorithm could use 4:4:4 or 4:2:0 subsampling for input data
NLM could also have independent denosing parameters for Y and Cb/Cr channels for 4:2:0 and 4:4:4 subsampling modes

NLM denoiser parameters: 3×3, search window 5×5, strength 500
Bilateral denoiser parameters: 3×3, sigmaColor 5, sigmaSpace 500

Software: OS Windows-10, CUDA-12.3
Hardware: NVIDIA GeForce RTX 4090

  • RAW DWT denoiser – 1.8 ms (4.9 GPix/s)
  • DWT denoiser (YCbCr, 4:4:4) – 3.05 ms (2.9 GPix/s)
  • NLM denoiser (RGB) - 1.44 ms (6.2 GPix/s)
  • NLM denoiser (YCbCr, 4:2:0) - 0.93 ms (9.5 GPix/s)
  • NLM denoiser (YCbCr, 4:4:4) - 1.64 ms (5.4 GPix/s)
  • Bilateral denoiser (RGB) - 1.21 ms (7.3 GPix/s)

The above results show super fast performance and they are comparable with the processing time of our best MG debayer algorithm which is around 1.05 ms (8.5 GPix/s) for the same image on that GPU. Our denoisers used to be much slower than demosaicing algorithms.

We have designed that software as a part of our GPU Image & Video Processing SDK. Now our customers have opportunity to utilize these CUDA-accelerated denoisers in their applications as a part of their image processing pipeline.


To test our CUDA denoiser filters, please download Fast VCR software which is capable of working not only with machine vision cameras at real time, but also with RAW images from SSD. This is a real test to evaluate image quality and performance.

CUDA denoiser roadmap

  • Acceleration of Bilateral denoiser - in progress
  • Temporal denoiser filter on CUDA - in progress

Contact Form

This form collects your name and email. Check out our Privacy Policy on how we protect and manage your personal data.