Jetson Nano Benchmarks on Fastvideo SDK

Author: Fyodor Serzhenko

Embedded imaging applications can definitely benefit from the latest release of NVIDIA Jetson Nano hardware. NVIDIA Jetson Nano is a small, powerful computer with embedded GPU that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing.

We've tested Image & Video Processing SDK from Fastvideo with NVIDIA Jetson Nano Developer Kit and here we present our results of benchmarking for software modules which are specific for camera applications.

 

Jetson Nano performance benchmarks

Fig.1. Jetson Nano Module

NVIDIA Jetson Nano hardware: Quad Core, 4GB RAM, GPU

  • 128-core Maxwell GPU (for display and compute)
  • Quad-core ARM A57 @ 1.43 GHz (main CPU)
  • 4 GB LPDDR4 (rated at 25.6 GB/s)
  • Gigabit Ethernet
  • 4x USB 3.0, USB 2.0 Micro-B (the Micro USB port could be utilized both for 5V power input and for data)
  • HDMI 2.0 & eDP 1.4 (4K monitor support, HDMI or Display Port)
  • Support of MIPI CSI-2 and PCIe Gen2 high-speed I/O
  • DC Barrel jack for 5V power input
  • Storage microSD
  • Dimensions: 100 mm × 80 mm × 29 mm (carrier board is included)

It's interesting to note that according to CUDA Device Query application, the name of tested Jetson Nano module is "NVIDIA Tegra X1" with CUDA Capability 5.3. So it reminds Jetson TX1, but with half of CUDA Cores.

Video Encoding and Decoding Options

  • Video Encode 4K @ 30 fps, 4x for 1080p @ 30 fps, 9x for 720p @ 30 fps (H.264/H.265)
  • Video Decode 4K @ 60 fps, 2x for 4K @ 30 fps, 8x for 1080p @ 30 fps, 18x for 720p @ 30 fps (H.264/H.265)

 

Nvidia Jetson Nano performance benchmark

Fig.2. Jetson Nano Developer Kit

Hardware and software for benchmarking

  • CPU/GPU NVIDIA Jetson Nano Developer Kit
  • OS L4T (Ubuntu 18.04)
  • JetPack 4.2 with CUDA CUDA Toolkit 10.2
  • Fastvideo SDK 0.17.1

Jetson Nano Power Consumption and Power Management

In Jetson Nano hardware, NVIDIA uses Dynamic Voltage and Frequency Scaling (DVFS) approach. That power management technology is utilized in most of modern computer hardware to maximize power savings, where the voltage used in a component is increased or decreased, depending upon external conditions.

Jetson Nano Developer Kit is configured to accept power via the Micro USB connector. Some Micro USB power supplies are designed in such a way to output slightly more than 5V to account for voltage loss across the cable. The critical point is that the Jetson Nano module requires a minimum of 4.75V to operate. It's recommended to use a power supply capable of delivering 5V at the J28 Micro-USB connector.

There are some other power supply options for Jetson Nano. If total load is expected to exceed 2A, e.g., due to peripherals attached to the carrier board or due to high performance computational tasks, you have to lock the J48 Power Select pins disable power supply via Micro USB and enable 5V-4A via the J25 power jack. Another option is to supply 5V-6A via the J41 expansion header (two 5V pins can be used to power the developer kit at 3A each). The Jetson Nano Developer Kit is equipped with a passive heatsink, to which a fan can be mounted. If we supply more than 5V (for example, 12V) over J25, then Nano will not work.

 

Jetson Nano GPU Benchmarks

Fig.3. Top View of Jetson Nano Developer Kit

 

In general, total power usage comprised of carrier board, Jetson Nano module and peripherals. It is determined by particular use case. The carrier board consumes between 0.5W (at 2A) and 1.25W (at 4A) with no peripherals attached.

Jetson Nano module is designed to optimize power efficiency and it supports two software-defined power modes. The default mode provides a 10W power budget for the modules, and the other, a 5W budget. These power modes constrain the module to near their 10W or 5W budgets by capping the GPU and CPU frequencies and the number of online CPU cores.

Individual parts of the CORE power domain, such as video encode (V4L2) and video decode (V4L2), are not covered by these budgets. This is a reason why power modes constrain Jetson Nano module to near a power budget, but not to the exact power budget. Your particular use case determines the module’s actual power consumption.

According to the performed tests with Fastvideo SDK, normal operation of Jetson Nano Developer Kit in 10W mode required more power than USB can offer (5V and 2A). USB-powered Jetson Nano can't work continuously under heavy workload on default clock (no jetson_clocks applied). It hanged up in 30-60 seconds after workload began. It seems to be due to power consumption by carrier board and other periphery devices. USB-powered Jetson Nano is working perfectly in 5W mode, but with less performance.

For Jetson Nano benchmark measurements was used external power supply with 5V and 4A. This is more than we could get from a standard Micro USB power adapter (5V and 2A), but it's necessary to get high performance. As we understand, one could get even better performance by supplying more power to Jetson Nano.

To manage the speed and the amount of power consumed on the NVIDIA Jetson Nano, we use nvpmodel -m0 and jetson_clocks to get maximum performance.

Jetson Nano Benchmark Performance for Camera Applications

For Jetson Nano we've done benchmarks for the following image processing kernels which are conventional for camera applications: white balance, demosaic, color correction, LUT, resize, gamma, jpeg / jpeg2000 / h.264 encoding, etc. It's not a full set of Fastvideo SDK features, but this is just an example of what we could get with Jetson Nano.

We've measured GPU kernel time for each image processing module to get understanding of how fast it could be done on Jetson Nano. This is the way to evaluate total time for the chosen set of modules from Fastvideo SDK. As soon as for some modules the performance depends on image content, you can request Fastvideo SDK for NVIDIA Jetson Nano (or for any othe NVIDIA GPU) for evaluation and to carry on with your own tesing.

CUDA initialization and GPU memory buffers allocations are not included in the benchmarks. Usually we do that just once, before the measurements, so it doesn't affect GPU performance.

For testing we've utilized 2K raw image (1920×1080, 8-bit) and 4K raw image (3840×2160, 8-bit), though all computations were carried out with 16-bit precision. Before JPEG compression we've converted 16-bit data to 8-bit per channel to comply with JPEG Standard. JPEG2000 compression benchmarks were measured for 24-bit images with 4:4:4 subsampling.

We've marked with gray color those rows in the Tables which are included in the simplest image processing pipeline of camera application for 2K and 4K resolutions. That pipeline consists of Host to Device Transfer, White Balance, HQLI Debayer, Color Correction, Gamma, JPEG compression, Device to Host Transfer. In the latest row of each Table we have shown the total GPU kernel time in ms, performance in MB/s and achieved FPS for the pipeline.

Table 1. Jetson Nano performance benchmarks for 2K raw image processing (1920×1080, 8-bit)

Algorithm and parameters Kernel time, ms Performance, MB/s Frames per second
White Balance 0.6 6,500 1,660
HQLI Debayer 1.8 2,200 550
DFPD Debayer 4.7 850 212
MG Debayer 12.7 315 78
Color Correction with 3×4 matrix 1.7 7,000 588
Resize from 2K to 960×540 10.0 600 100
Resize from 2K to 1919×1079 19.8 303 50
Gamma (1920×1080) 1.4 8,500 710
JPEG Encoding (1920×1080, 90%, 4:2:0) 4.3 1,400 230
JPEG Encoding (1920×1080, 90%, 4:4:4) 6.8 880 147
JPEG2000 Encoding (lossy, 32×32, single mode) 81 74 12
JPEG2000 Encoding (lossless, 32×32, single mode) 190 31 5
Total for camera application 9.8 204 102

 

In real life camera application, there is a possibility to eliminate Host to Device copy by utilizing Jetson Zero-Copy. In that case, image from a camera is written via DMA directly to pinned buffer in system memory. Pinned buffer is accessible in both CPU and GPU. As other option, Device to Host copy could be hidden by overlapping of data transfer and computations in multi-thread application. Jetson Nano can do concurrent copy and kernel execution with 1 copy engine.

We can see that for the simplest image processing pipeline for 2K image on NVIDIA Jetson Nano we can reach 100 fps performance. If we utilize H.264 encoding via hardware-based solution (instead of Fastvideo CUDA-based Motion JPEG encoding) for the same pipeline, we could get slower performance due to limitations of H.264 encoder for 2K resolution.

Table 2. Jetson Nano performance benchmarks for 4K raw image processing (3840×2160, 8-bit)

Algorithm and parameters Kernel time, ms Performance, MB/s Frames per second
White Balance 2.2 7,200 455
HQLI Debayer 7.1 2,250 141
DFPD Debayer 18.2 880 55
MG Debayer 50.3 318 20
Color Correction with 3×4 matrix 6.9 7,000 145
Resize from 4K to 1920×1080 39.4 610 25
Resize from 4K to 3839×2159 77.9 308 12
Gamma (3840×2160) 5.7 8,400 175
JPEG Encoding (3840×2160, 90%, 4:2:0) 17.1 1,400 58
JPEG Encoding (3840×2160, 90%, 4:4:4) 27.3 880 36
JPEG2000 Encoding (lossy, 32×32, single mode) 309 77 3
JPEG2000 Encoding (lossless, 32×32, single mode) 620 38 1.6
Total for camera application 32.1 248 31

 

The same image processing pipeline for 4K RAW image on NVIDIA Jetson Nano could bring us the performance 30 fps. If we utilize H.264 encoding via hardware-based solution (instead of Fastvideo JPEG or MJPEG on GPU), we still get not more than 30 fps, which is the maximum for H.264 encoder for 4K resolution, but GPU occupancy in that case would be less.

We can see that Jetson Nano has sufficient performance for image processing in camera applications. For resolutions up to 4K we can get realtime performance to convert RAW to RGB with JPEG or H.264 compression.

Camera Streaming with Jetson Nano

As soon as we can implement full image processing pipeline on NVIDIA Jetson Nano together with hardware-based H.264 or H.265 encoding, this is an excellent platform for camera streaming from drones, UAVs, etc.

Software and hardware for testing

Processing pipeline for NVIDIA Jetson GPU: acquisition from the camera -> black level -> white balance -> demosaic -> H.264/H.265 encoding -> RTSP streaming via Ethernet or wi-fi.

Desktop GPU station processing: receiving the stream -> render on a screen (we don't use GStreamer at all).

Streaming benchmarks for Jetson Nano

  • 60 FPS
  • Jetson processing pipeline: 11 ms
  • Network: 1 ms average
  • Desktop processing: 6 ms
  • G2G latency: 70-100 ms (the rest of the time is the acquisition, data transfer, screen response)

Here we've published just a small part of Jetson Nano benchmarks that we've actually got with Fastvideo SDK. We would suggest to test that SDK with your image processing pipeline. You can send us your request to get evaluation version of Fastvideo Image Processing SDK for Jetson Nano, TK1, TX1, TX2 or NX/AGX Xavier to carry out your testing for your images and your pipeline. Just fill the Contact Form below to get that SDK for your Jetson.

Other blog posts from Fastvideo about Jetson hardware and software

Contact Form

This form collects your name and email. Check out our Privacy Policy on how we protect and manage your personal data.