Fast FFmpeg J2K decoder on NVIDIA GPU
FFmpeg is great software which is offering just a huge amount of options for image and video processing, handling of multimedia files and streams. It supports many formats, codecs and filters for various tasks. This is the reason why it's so widespread in the world. Many applications are based on FFmpeg and their flexibility and performance are really impressive. FFmpeg is actually a command-line application which is also capable of video transcoding and video post production. The name of FFmpeg comes from MPEG video standards group, together with "FF" which means "fast forward".
To carry on with FFmpeg, user need to download the software from ffmpeg.org or zeranoe.com. To build own solution, user has to go to Git to download source codes for the latest version and to build FFmpeg with all necessary options.
How FFmpeg can decode J2K?
For a start we could answer the following very simple question - which JPEG2000 codec is working at FFmpeg by default? Surprisingly, this is not OpenJPEG codec. FFmpeg has its own J2K codec. In the FFmpeg documentation we can see the following: "The native jpeg2000 encoder is lossy by default, the -q:v option can be used to set the encoding quality. Lossless encoding can be selected with -pred 1".
This is not a good choice, so we can install OpenJPEG library (libopenjpeg) as default FFmpeg codec for J2K encoding and decoding on CPU. OpenJPEG is quite reliable and sofisticated solution with wide set of features from JPEG2000 Standard. OpenJPEG codec is very interesting product, but it's working on CPU only. As soon as J2K algorithm has very high computational complexity, OpenJPEG is running not fast even with multithreading. OpenJPEG is still very slow even after recent boost with optimization and multithreading. Here you can see J2K benchmarks on CPU and GPU for J2K encoding and decoding for OpenJPEG, Jasper, Kakadu, J2k-Codec, CUJ2K, Fastvideo codecs to check the performance for images with 2K and 4K resolutions (both for lossy and lossless algorithms).
How FFmpeg is working internally?
FFmpeg usage is based on the idea of consequtive software modules which are applied to your data. As soon as most of FFmpeg codecs and filters are working on CPU, both input and output of each processing module are at CPU memory, though currently FFmpeg is also capable to work with GPU-based NVENC encoder and NVDEC decoder on NVIDIA GPUs. That NVIDIA codec supports H.264 and H.265 codecs and much more.
To create conventional FFmpeg codec for fast J2K encoding or decoding on GPU, we've taken into account architectures of FFmpeg applications and FFmpeg codecs. We've implemented FFmpeg J2K decoder which is working on GPU with batch of images in multithreaded mode to achieve maximum performance. Externally it looks like conventional decoder with internal multithreading. Now that J2K decoder could be utilized in FFmpeg and it could be included in FFmpeg processing workflow as a part of any complicated task.
That FFmpeg decoder is fully based on Fastvideo J2K decoder which is implemented on NVIDIA GPU. That J2K decoder could be used in many FFmpeg applications in a standard way. To follow that standard approach, user just needs to build FFmpeg with that J2K library.
How to build FFmpeg with Fastvideo J2K decoder
1. Download FFmpeg source for Ubuntu. Version 4.2.2 has been used for testing on Ubuntu 18.04
You can retrieve the source code through Git by using the following command:
To get Fastvideo SDK, please send your request via form below at the bottom of that page.
2. Install NVENC headers by install_nvenc.sh. NVIDIA driver is 440.33.01.
3. Copy folder fastvideo_sdk (including /inc and /lib folders) in root of FFmpeg source folder.
6. Configure FFmpeg with listed below minimum options. This list can be extended by end user. CUDA path is default for 10.1 version.
8. make install
9. Update LD_LIBRARY_PATH for FFmpeg and Fastvideo libraries
10. Copy /video folder to /bin folder of FFmpeg and run run.snowman.sh to test.
Fastvideo J2K decoder parameters: threads and batch
Fastvideo J2K decoder on GPU for FFmpeg has two additional parameters that influence on the performance. These are -threads and -batch.
Parameter "threads" is FFmpeg parameter. It defines the number of concurrent CPU threads for processing. This option is accessible for J2K decoder with frame-level multithreading.
Parameter "batch" defines the number of frames, processed by one decoder in parallel. FFmpeg does not support batch mode for multiple decoders. FFmpeg supports batch mode only for a single decoder. This is not enough for Fastvideo J2K decoder to get the best performance.
To discard this limitation, Fastvideo J2K decoder for FFmpeg uses internal client-server architecture. Client is FFmpeg worker thread that takes bytestream from FFmpeg and sends it to decoder. The amount of real J2K decoders is a number of FFmpeg worker threads divided to batch size. Therefore, the number of worker threads has to be divisible by batch size.
To get the best performance, batch size has to be at least 4 and the number of worker threads has to be at least 8. This results in two real J2K decoders. If we increase batch size and the number of J2K decoders, we improve GPU memory usage.
Fast J2K transcoding with FFmpeg and NVENC from MXF to MP4
These are examples of command-line how we could decode snowman.mxf file from the current folder and create MP4 video file with H.264 or H.265 encoding at the same pipeline with different sets of parameters:
This is the link to download test video file - http://dcp3d.ru/download/SNOWMAN-DCP3D.rar
Basically, we read and parse frames from snowman.mxf file, decode them on GPU with id = 0 (batch = 2, four CPU threads) and encode that stream to MP4 at 5 Mbit/s and save it to *.mp4 file in the current folder.
Batch size and number of threads depend on the size of free GPU memory. If utilized parameters are too big, then user will get a warning to make batch size less or to utilize better GPU with more memory.
Simple benchmark for FFmpeg J2K transcoding to H.264 on GPU
The task of J2K transcoding to H.264 is quite common. Though it's not possible to get realtime performance with OpenJPEG codec from FFmpeg. Fastvideo J2K decoder together with NVIDIA NVENC could solve the full task of J2K transcoding on GPU and it will be much faster than realtime. Resulted performance depends on many factors, but here we just indicate standard case:
Transcoding from 200 Mbit/s J2K to 15 Mbit/s H.264 on NVIDIA Quadro RTX 4000 could be done for two streams 1080p/60 (4:2:2, 10-bit) at the same time or for four streams 1080p/30.
Fast J2K decoding with FFmpeg from MXF to RGB or YUV frames
Fastvideo J2K decoder supports multiple output formats. These are NV12, P010, YUV444, YUV444P10, RGB24, RGB48. Formats NV12, P010, YUV444, YUV444P10 are native for NVENC. Decoded frame can be placed in host-to-device buffers. Device buffer is used for NVENC, to remove additional device-to-host and host-to-device copies. Host buffer is used for integration with other FFmpeg codecs. Formats NV12, P010, YUV444, YUV444P10 support both buffer types. Formats RGB24 and RGB48 support only host buffer type.
Format NV12 is native NVENC format. It contains mixed UV plane in contrast to classic YUV420. Format P010 is 16-bit per element version of NV12 format.
This is an example of how we could decode snowman.mxf file from the current folder and create a series of RGB or YUV images:
We could possibly need such a solution if we are going to do final video encoding on CPU. If we compare CPU-based H.264 or H.265 encoding performance with J2K on GPU, we can see that performance of J2K decoding is much higher, so we could decode multiple streams on GPU and then encode them on CPU. Usually we will need one CPU thread per stream for encoding. Multicore CPU is a must here. This is actually a task of live transcoding where we could combine GPU and CPU to build high performance solution.
If we don't have an output format that you need, please let us know and we will add it. We do both J2K decoding and format conversions on GPU to improve total performance. This is very important in processing of multiple streams.
Fast J2K decoding with FFmpeg for MXF Player
If you have ever tried to play MXF files with VLC player, you probably know the result. Unfortunately, any MXF video with J2K frames is too complicated for CPU-based VLC software and you could hardly achieve viewing at 1 fps, which is not acceptable.
DCP package contains J2K frames inside and it's quite difficult task to offer smooth preview for that content on CPU. Now you can decode J2K frames on GPU and show the results via ffplay or with any other player which is connected to FFmpeg output.
Apart from J2K decoding on GPU, we will soon release FFmpeg-based J2K encoder on GPU which could be utilized to create DCP with very high performance. It will be much faster in comparison with OpenJPEG.