Jetson Nano Benchmarks on Fastvideo SDK
Embedded imaging applications can definitely benefit from the latest release of NVIDIA Jetson Nano hardware. NVIDIA Jetson Nano is a small, powerful computer with embedded GPU that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing.
We've tested Image & Video Processing SDK from Fastvideo with NVIDIA Jetson Nano and here we present our results of benchmarking for software modules which are specific for camera applications.
Fig.1. Jetson Nano Module
NVIDIA Jetson Nano hardware: Quad Core, 4GB RAM, GPU
It's interesting to note that according to CUDA Device Query application, the name of tested Jetson Nano module is "NVIDIA Tegra X1" with CUDA Capability 5.3. So it reminds Jetson TX1, but with half of CUDA Cores.
Video Encoding and Decoding Options (NVIDIA NVENC and NVDEC)
Fig.2. Jetson Nano Developer Kit
Hardware and software for benchmarking
Jetson Nano Power Consumption and Power Management
In Jetson Nano hardware, NVIDIA uses Dynamic Voltage and Frequency Scaling (DVFS) approach. That power management technology is utilized in most of modern computer hardware to maximize power savings, where the voltage used in a component is increased or decreased, depending upon external conditions.
Jetson Nano Developer Kit is configured to accept power via the Micro USB connector. Some Micro USB power supplies are designed in such a way to output slightly more than 5V to account for voltage loss across the cable. The critical point is that the Jetson Nano module requires a minimum of 4.75V to operate. It's recommended to use a power supply capable of delivering 5V at the J28 Micro-USB connector.
There are some other power supply options for Jetson Nano. If total load is expected to exceed 2A, e.g., due to peripherals attached to the carrier board or due to high performance computational tasks, you you have to lock the J48 Power Select pins disable power supply via Micro USB and enable 5V-4A via the J25 power jack. Another option is to supply 5V-6A via the J41 expansion header (two 5V pins can be used to power the developer kit at 3A each). The Jetson Nano Developer Kit is equipped with a passive heatsink, to which a fan can be mounted.
Fig.3. Top View of Jetson Nano Developer Kit
In general, total power usage comprised of carrier board, Jetson Nano module and peripherals. It is determined by particular use case. The carrier board consumes between 0.5W (at 2A) and 1.25W (at 4A) with no peripherals attached.
Jetson Nano module is designed to optimize power efficiency and it supports two software-defined power modes. The default mode provides a 10W power budget for the modules, and the other, a 5W budget. These power modes constrain the module to near their 10W or 5W budgets by capping the GPU and CPU frequencies and the number of online CPU cores.
Individual parts of the CORE power domain, such as video encode (NVENC) and video decode (NVDEC), are not covered by these budgets. This is a reason why power modes constrain Jetson Nano module to near a power budget, but not to the exact power budget. Your particular use case determines the module’s actual power consumption.
According to the performed tests with Fastvideo SDK, normal operation of Jetson Nano Developer Kit in 10W mode required more power than USB can offer (5V and 2A). USB-powered Jetson Nano can't work continuously under heavy workload on default clock (no jetson_clocks applied). It hanged up in 30-60 seconds after workload began. It seems to be due to power consumption by carrier board and other periphery devices. USB-powered Jetson Nano is working perfectly in 5W mode, but with less performance.
For Jetson Nano benchmark measurements was used external power supply with 5V and 4A. This is more than we could get from a standard Micro USB power adapter (5V and 2A), but it's necessary to get high performance. As we understand, one could get even better performance by supplying more power to Jetson Nano.
To manage the speed and the amount of power consumed on the NVIDIA Jetson Nano, we use nvpmodel -m0 and jetson_clocks to get maximum performance.
Jetson Nano Benchmark Performance for Camera Applications
For Jetson Nano we've done benchmarks for the following image processing kernels which are conventional for camera applications: white balance, demosaic, color correction, LUT, resize, gamma, jpeg / jpeg2000 / h.264 encoding, etc. It's not a full set of Fastvideo SDK features, but this is just an example of what we could get with Jetson Nano.
We've measured GPU kernel time for each image processing module to get understanding of how fast it could be done on Jetson Nano. This is the way to evaluate total time for chosen set of modules from Fastvideo SDK. As soon as for some modules the performance depends on image content, you can request Fastvideo SDK for NVIDIA Jetson Nano (or for any othe NVIDIA GPU) for evaluation and to carry on with your own tesing.
CUDA initialization and GPU memory buffers allocations are not included in the benchmarks. Usually we do that just once, before the measurements, so it doesn't affect GPU performance.
For testing we've utilized 2K raw image (1920×1080, 8-bit) and 4K raw image (3840×2160, 8-bit), though all computations were carried out with 16-bit precision. Before JPEG compression we've converted 16-bit data to 8-bit per channel to comply with JPEG Standard. JPEG2000 compression benchmarks were measured for 24-bit images with 4:4:4 subsampling.
We've marked with gray color those rows in the Tables which are included in the simplest image processing pipeline of camera application for 2K and 4K resolutions. That pipeline consists of Host to Device Transfer, White Balance, HQLI Debayer, Color Correction, Gamma, JPEG compression, Device to Host Transfer. In the latest row of each Table we have shown the total GPU kernel time in ms, performance in MB/s and achieved FPS for the pipeline.
Table 1. Jetson Nano performance benchmarks for 2K raw image processing (1920×1080, 8-bit)
In real life camera application, there is a possibility to eliminate Host to Device copy by utilizing Jetson Zero-Copy. In that case, image from a camera is written via DMA directly to pinned buffer in system memory. Pinned buffer is accessible in both CPU and GPU. As other option, Device to Host copy could be hidden by overlapping of data transfer and computations in multi-thread application. Jetson Nano can do concurrent copy and kernel execution with 1 copy engine.
We can see that for the simplest image processing pipeline for 2K image on NVIDIA Jetson Nano we can reach 100 fps performance. If we utilize H.264 encoding via hardware-based NVENC (instead of Fastvideo CUDA-based Motion JPEG encoding) for the same pipeline, we could get 120 fps performance, which is the limitation of H.264 encoder (NVENC) for 2K resolution.
Table 2. Jetson Nano performance benchmarks for 4K raw image processing (3840×2160, 8-bit)
The same image processing pipeline for 4K RAW image on NVIDIA Jetson Nano could bring us the performance 30 fps. If we utilize H.264 encoding via hardware-based NVENC (instead of Fastvideo JPEG or MJPEG on GPU), we still get not more than 30 fps, which is the maximum for H.264 encoder (NVENC) for 4K resolution, but GPU occupancy in that case would be less.
We can see that Jetson Nano has sufficient performance for image processing in camera applications. For resolutions up to 4K we can get realtime performance to convert RAW to RGB with JPEG or H.264 compression.
Here we've published just a small part of Jetson Nano benchmarks that we've actually got with Fastvideo SDK. We would suggest to test that SDK with your image processing pipeline. You can send us your request to get evaluation version of Fastvideo Image Processing SDK for Jetson Nano, TK1, TX1, TX2 or AGX Xavier to carry out your testing for your images and your pipeline. Just fill the Contact Form below to get that SDK for your Jetson.
Other blog posts from Fastvideo about Jetson hardware and software