Benchmark comparison for Jetson TX2, Xavier NX/AGX, Orin AGX

Author: Fyodor Serzhenko

NVIDIA has released a series of Jetson hardware modules for embedded applications. NVIDIA® Jetson is the world's leading embedded platform for image processing and DL/AI tasks. Its high-performance, low-power computing for deep learning and computer vision makes it the ideal platform for mobile compute-intensive projects.

We've developed an Image & Video Processing SDK for NVIDIA Jetson hardware. Here we present performance benchmarks for the available Jetson modules. As an image processing pipeline, we consider a basic camera application as a good example for benchmarking.

Jetson Performance Benchmark Comparison: TX2 vs NX vs AGX vs Orin

Hardware features for Jetson TX2, NX/AGX Xavier, AGX Orin

Here we present a brief comparison for Jetsons hardware features to see the progress and variety of mobile solutions from NVIDIA. These units are aimed at different markets and tasks.

Table 1. Hardware comparison for Jetson modules

Hardware feature \ Jetson module	Jetson TX2/TX2i	Jetson NX Xavier	Jetson AGX Xavier	Jetson AGX Orin
CPU (ARM)	4-core ARM Cortex-A57 @ 2 GHz, 2-core Denver2 @ 2 GHz	6-core ARM Carmel v8.2	8-core ARM Carmel v.8.2 @ 2.26 GHz	12-core ARM Cortex-A78AE
GPU	256-core Pascal @ 1.3 GHz	384-core Volta	512-core Volta @ 1.37 GHz	2048-core Ampere
Memory	8 GB 128-bit LPDDR4, 58.3 GB/s	16 GB 128-bit LPDDR4, 51.2GB/s	16 GB 256-bit LPDDR4, 137 GB/s	64 GB 256-bit LPDDR5, 205 GB/s
Storage	32 GB eMMC	16 GB eMMC	32 GB eMMC	64 GB eMMC
Tensor cores	--	48	64	64
Video encoding	1x 4K60 (H.265) 3x 4K30 (H.265) 4x 1080p60 (H.265)	2x 4K30 (H.265) 6x 1080p60 (H.265)	4x 4K60 (H.265) 16x 1080p60 (H.265) 32x 1080p30 (H.265)	2x 4K60, 4x 4K30, 8x 1080p60, 16x 1080p30 (H.265) H.264, AV1
Video decoding	2x 4K60 (H.265) 7x 1080p60 (H.265) 14x 1080p30 (H.265)	2x 4K60 (H.265) 12x 1080p60 (H.265) 16x 1080p30 (H.265)	2x 8K30 (H.265) 6x 4K60 (H.265) 26x 1080p60 (H.265) 72x 1080p30 (H.265)	1x 8K30, 3x 4K60, 7x 4K30, 11x 1080p60, 22x 1080p30 (H.265) H.264, VP9, AV1
PCI-Express lanes	5 lanes PCIe Gen 2	1 x1 (PCIe Gen3)+ 1 x4 (PCIe Gen4)	16 lanes PCIe Gen 4	Up to 2x8, 1x4, 2x1 PCIe Gen 4
Power	7.5W / 15W	10W / 15W	10W / 15W / 30W	15W - 60W

In camera applications, we can usually hide Host-to-Device transfers by implementing GPU Zero Copy or by overlapping GPU copy/compute. Device-to-Host transfers can be hidden via copy/compute overlap.

Hardware and software for benchmarking

CPU/GPU NVIDIA Jetson TX2, Xavier NX/AGX, AGX Orin
OS L4T (Ubuntu 18.04)
CUDA Toolkit 10.2 for Jetson TX2, Xavier NX and AGX Xavier
CUDA Toolkit 11.4 for Jetson AGX Orin
Fastvideo SDK 0.18.2

NVIDIA Jetson Comparison: TX2 vs Xavier NX vs AGX Xavier vs AGX Orin

For these NVIDIA Jetson modules, we've done performance benchmarking for the following standard image processing tasks which are specific for camera applications: white balance, demosaic (debayer), color correction, resize, JPEG encoding, etc. That's not the full set of Fastvideo SDK features, but it's just an example to see what kind of performance we could get from each Jetson. You can also choose a particular debayer algorithm and output compression (JPEG or JPEG2000) for your pipeline.

Table 2. GPU kernel times for 2K image processing (1920×1080, 16 bits per channel, milliseconds)

Algorithm and parameters / Jetson model	Jetson TX2/TX2i	Jetson NX Xavier	Jetson AGX Xavier	Jetson AGX Orin
White Balance	0.24	0.19	0.08	0.06
L7 Debayer (window 7×7)	0.87	0.61	0.40	0.30
DFPD Debayer (window 11×11)	2.06	1.08	0.95	0.70
MG Debayer (window 23×23)	5.9	2.73	2.2	1.7
Color Correction with 3×4 matrix	0.81	0.55	0.25	0.20
Resize from 2K to 960×540	4.3	2.21	1.5	1.0
Resize from 2K to 1919×1079	8.2	4.34	2.4	1.8
Gamma (1920×1080)	0.84	0.42	0.2	0.1
JPEG compression (1920×1080, 90%, 4:2:0)	1.7	1.09	0.62	0.4
JPEG compression (1920×1080, 90%, 4:4:4)	2.6	1.5	0.75	0.6
Total for simple camera pipeline (ms)	4.8	2.85	1.53	1.06

Total kernel times are calculated for the values from the gray rows of the table. This is done to show the performance benchmarks on GPU for a specified set of image processing modules which correspond to camera applications. These are estimates for GPU processing time which were derived from optimization routines for each particular image processing module from Fastvideo SDK. We've run each module at multithreaded mode to get these benchmarks. You can download the SDK and check the performance for each module with optimization parameters "-thread 4 -async". If you can process your data in more that 4 threads, one can get a better performance due to better GPU load. In general, the performance depends on image size and for bigger images we get better performance until we reach performance saturation which depends on the Jetson hardware.

Each Jetson module was run with maximum performance

MAX-N mode for Jetson AGX Xavier and AGX Orin
15W for Jetson Xavier NX and Jetson TX2

Here we've compared just the basic set of image processing modules from Fastvideo SDK to let Jetson developers evaluate the expected performance before building their imaging applications. Image processing from RAW to RGB or RAW to JPEG are standard tasks, and now developers can get detailed info about expected performance for the chosen pipeline according to the table above. We haven't tested Jetson H.264 and H.265 encoders and decoders in that pipeline. As soon as H.264 and H.265 encoders are working at the hardware level, encoding can be done in parallel with CUDA code, so we should be able to get even better performance.

We've done the same kernel time measurements for NVIDIA GeForce and Quadro GPUs. Here you can get the document with the benchmarks.

Software for Jetson performance comparison

We've released the software for a GPU-based camera application on GitHub, and it's available to download both binaries and source codes for our gpu camera sample project. It's implemented for Windows-10, Linux Ubuntu 18.04 and L4T. Apart from a full image processing pipeline on GPU for still images from SSD and for live camera output, there are options for streaming and for glass-to-glass (G2G) measurements to evaluate real latency for camera systems on Jetson. The software currently works with machine vision cameras from XIMEA, Basler, Balluff, Imperx, Lucid Vision Labs, Daheng Imaging, Mindvision, etc. Support for new cameras is in progress.

We've also implemented support for MIPI CSI cameras as gpu camera sample project. We can process images from these cameras on GPU instead of utilizing hardware-based NVIDIA ISP (libargus). The reason to do so is to improve image quality due to 16-bit workflow and to offer more sofisticated image processing pipeline.

Recently we've implemented high performance FastVCR software for XIMEA cameras. You can download that application from our site (both for Windows and for Jetson), it's ready to work with USB3 and PCIe XIMEA cameras, and also it can work with RAW/PGM/TIFF images from SSD, which could be useful for performance/quality evaluation. Apart from GPU-based ISP, that software includes real time image sensor control, many image/video encoding options, RTSP streaming, CLI version, Glass-to-Glass test, etc. We are offering custom software design to adapt that software to your camera. Please send us your specification if you have such a task.

To check the performance of Fastvideo SDK on a laptop/desktop/server GPU without any programming, you can download Fast CinemaDNG Processor software with GUI for Windows. That software has a Performance Benchmarks window, and there you can see timing for each stage of image processing. This is a more sofisticated method of performance testing, because the image processing pipeline in that software can be quite advanced, and you can test any module you need. You can also perform various tests on images with different resolutions to see how much the performance depends on image size, content and other parameters.