Image & Video Processing SDK for Tegra X1 GPU

NVIDIA Tegra X1 is the latest mobile processor with high-performance Maxwell GPU architecture. Tegra X1 offers very high CPU and GPU performance to mobile computing. PC class imaging applications that require high performance, low energy consumption and large amounts of memory can now be developed for mobile devices with Tegra X1. We have ported our high performance image and video processing SDK to Tegra and now we can offer super fast imaging solutions for Nvidia’s quad-core Tegra X1 GPU for realtime imaging and video applications.

Tegra X1 GPU software

Tegra X1 benchmarks for image and video processing on mobile GPU

We have done measurements of kernel times for the most important components of GPU image and video processing SDK from Fastvideo. We utilized images with 2K and 4K resolutions and got the following averaged results. This is just a small set of features from what we have in our SDK.

Tegra X1 performance for images with 2K resolution (1920×1080)

  • HQLI Debayer (8-bit, RGGB) – 0.7 ms
  • HQLI Debayer (16-bit, RGGB) – 1.0 ms
  • DFPD Debayer (8-bit, RGGB) – 2.7 ms
  • DFPD Debayer (16-bit, RGGB) – 2.5 ms
  • MG Debayer (16-bit, RGGB) – 7.8 ms
  • JPEG encoding (8-bit, quality 90%) – 1.5 ms
  • JPEG encoding (24-bit, quality 90%, 4:2:0) – 2.4 ms
  • JPEG encoding (24-bit, quality 90%, 4:4:4) – 3.7 ms
  • Combined HQLI Debayer + JPEG compression (quality 90%, 4:2:0) – 3.3 ms
  • Combined DFPD Debayer + JPEG compression (quality 90%, 4:4:4) – 6.3 ms
  • JPEG decoding (8-bit, quality 90%) – 3.1 ms
  • JPEG decoding (24-bit, quality 90%, 4:2:0) – 6.2 ms
  • JPEG decoding (24-bit, quality 90%, 4:4:4) – 8.3 ms
  • 1D LUT (8-bit, for each color component) – 0.8 ms
  • 24-bit image sharpening (sigma=0.4) – 0.9 ms
  • 24-bit image crop from 1920×1080 to 960×540 – 0.24 ms
  • 24-bit image crop from 1920×1080 to 1919×1079 – 0.8 ms
  • 24-bit image resize (algorithm Lanczos3) from 1920×1080 to 960×540 – 5.6 ms
  • 24-bit image resize (algorithm Lanczos3) from 1920×1080 to 1919×1079 – 10.6 ms
  • Denoiser (8-bit, wavelet 5/3, 7 dwt resolutions) – 4.9 ms
  • Denoiser (24-bit, wavelet 5/3, 7 dwt resolutions) – 13.5 ms
  • Denoiser (8-bit, wavelet 9/7, 7 dwt resolutions) – 5.5 ms
  • Denoiser (24-bit, wavelet 9/7, 7 dwt resolutions) – 15.2 ms

Tegra X1 performance for images with 4K resolution (3840×2160)

  • HQLI Debayer (8-bit, RGGB) – 2.7 ms
  • HQLI Debayer (16-bit, RGGB) – 3.9 ms
  • DFPD Debayer (8-bit, RGGB) – 11.1 ms
  • DFPD Debayer (16-bit, RGGB) – 9.6 ms
  • MG Debayer (16-bit, RGGB) – 32.0 ms
  • JPEG encoding (8-bit, quality 90%) – 6.3 ms
  • JPEG encoding (24-bit, quality 90%, 4:2:0) – 9.4 ms
  • JPEG encoding (24-bit, quality 90%, 4:4:4) – 14.9 ms
  • Combined HQLI Debayer + JPEG compression (quality 90%, 4:2:0) – 11.8 ms
  • Combined DFPD Debayer + JPEG compression (quality 90%, 4:4:4) – 26 ms
  • JPEG decoding (8-bit, quality 90%) – 11.9 ms
  • JPEG decoding (24-bit, quality 90%, 4:2:0) – 27.1 ms
  • JPEG decoding (24-bit, quality 90%, 4:4:4) – 32.6 ms
  • 1D LUT (8-bit, for each color component) – 2.7 ms
  • 24-bit image sharpening (sigma=0.4) – 3.3 ms
  • 24-bit image crop from 3840×2160 to 1920×1080 – 0.8 ms
  • 24-bit image crop from 3840×2160 to 3839×2159 – 2.9 ms
  • 24-bit image resize (algorithm Lanczos3) from 3840×2160 to 1920×1080 – 20.4 ms
  • 24-bit image resize (algorithm Lanczos3) from 3840×2160 to 3839×2159 – 39.5 ms
  • Denoiser (8-bit, wavelet 5/3, 7 dwt resolutions) – 20 ms
  • Denoiser (24-bit, wavelet 5/3, 7 dwt resolutions) – 55 ms
  • Denoiser (8-bit, wavelet 9/7, 7 dwt resolutions) – 21 ms
  • Denoiser (24-bit, wavelet 9/7, 7 dwt resolutions) – 59 ms

Tegra X1 performance for 12-bit per pixel images with resolution 4032×2192

  • JPEG encoding (gray, 12-bit, quality 90%) – 11.2 ms
  • JPEG encoding (color, 12-bit, quality 90%, 4:2:0) – 17.9 ms
  • JPEG encoding (color, 12-bit, quality 90%, 4:4:4) – 28.6 ms

Tegra X1 performance for 5120×3840, 12-bit (CMOSIS CMV20000 image sensor)

  • DFPD Debayer (16-bit, RGGB) – 25 ms
  • JPEG encoding (12-bit, quality 90%, 4:2:0) – 27 ms

All the above Tegra X1 benchmarks don't take into account timings for host-to-device and device-to-host transfers. In general, this is right approach to evaluate total time for various quite complicated image processing pipelines on Tegra X1. Still there is a question about delay that we get from CPU/GPU communications. According to our measurements, data exchange between CPU and GPU on Tegra X1 is done with performance ~10 GB/s, so the transfer time will be not more than just a couple of ms.

Roadmap 2017

     Home                   Contacts                 Site Map
GPU Image Processing