Low-latency software for remote collaborative post production

Author: Fyodor Serzhenko

Fastvideo company is a team of professionals in GPU image processing, realtime camera applications, digital cinema, high performance imaging solutions. Fastvideo has been helping production companies for quite a long time and recently we've implemented low-latency software to offer collaborative post production.

Today, with restrictions on in-person collaboration, delays in shipping and limitations on travel, single point of ingest and delivery for an entire production becomes vitally important. The main goal is to offer all services both on-premises and remotely. We believe that in the near future we will see virtual and distributed post production finishing.

When you are shooting a movie at tight schedule and you need to accelerate your post production workflow, then remote collaborative approach is a right solution. You don't need to have all professionals on-site, via remote approach you can collaborate at realtime wherever your teammates are located. Industry trend to remote production solutions is clear and it happens not just due to the coronavirus. The idea to accelerate post via remote operation is viable and companies strive to remove various limitations of conventional workflow - now the professionals could choose a place and a time to work remotely on post production.

Nowadays, there are quite a lot of software solutions to offer reliable remote access via local networks or via public internet. Still, most of them were built without an idea about professional usage in tasks like colour grading, VFX, compositing and much more. In post production we need to utilize professional hardware which could visualize 10-bit or 12-bit footages. Skype, ZOOM and many other video conference solutions are not capable of doing that, so we've implemented the software to solve that matter.

Business goals to achieve at remote collaborative post production

You will share content in realtime for collaborative workflows in post production
Lossless or visually lossless encoding guarantees high image quality and exact colour reproduction
Reduced travel and rent costs for the team due to remote colour grading and reviewing
Remote work will allow to choose the best professionals for the production
Your team will work on multiple projects (time saving and multi-tasking)

Goals from technical viewpoint

Low latency software
Fast and reliable data transmission over internal or public network
Fast acquisition and processing of SD/HD-SDI and 3G-SDI streams (unpacking, packing, transforms)
Realtime J2K encoding and decoding (lossy or lossless)
High image quality
Precise colour reproduction
Maximum bit depth (10-bit or 12-bit per channel)

Task to be solved

Post industry needs low-latency, high quality video encode/decode solution for remote work according to the following pipeline:

Capture baseband video streams via HD-SDI or 3G-SDI frame grabber (Blackmagic DeckLink 8K Pro, AJA Kona 4 or Kona 5)
Live encoding with J2K codec that supports 10-bit YUV 4:2:2 and 10/12-bit 4:4:4 RGB
Send the encoded material via TCP/UDP packets to a receiver/decoder - point-to-point transmission over ethernet or public internet
Decode from stream at source colorspace/bit-depth/resolution/subsampling - Rec.709/Rec.2020, 10-bit 4:2:2 YUV or 10/12-bit 4:4:4 RGB
Send stream to baseband video playout device (Blackmagic/AJA frame grabber) to display 10-bit YUV 4:2:2 or 10/12-bit 4:4:4 RGB material on external display
Latency requirements: sub 300 ms

Basic hardware layout: Video Source (Baseband Video) -> Capture device (DeckLink) -> SDI unpacking on GPU -> J2K Encoder on GPU -> Facility Firewall (IPsec VPN) -> Public Internet -> Remote Firewall (IPsec VPN) -> J2K Decoder on GPU -> SDI packing on GPU -> Output device (DeckLink) -> Video Display (Baseband Video)

Hardware/software/parameters

HD-SDI or 3G-SDI frame grabbers: Blackmagic DeckLink 8K Pro, AJA Kona 4, AJA Kona 5
NVIDIA GPU: GeForce RTX 2070, Quadro RTX 4000 or better
OS: Windows-10 or Linux Ubuntu/CentOS
Frame Size: 1920×1080 (DCI 2K)
Frame Rates: 23.976, 24, 25, 29.97, 30 fps
Bit-depth: 8/10/12 (encode - ingest), 8/10/12 (decode - display)
Pixel formats: RGB or RGBA, v210, R12L
Frame compression: lossy or lossless
Colour Spaces for 8/10-bit YUV or 8/10/12-bit RGB: Rec.709, DCI-P3, P3-D65, Rec.2020 (optional)
Audio: 2-channel PCM or more

How to encode/decode J2K images fast?

CPU-based J2K codecs are quite slow. For example, if we consider FFmpeg-based software solutions, they are working with J2K codec from libavcodec (mj2k) or with OpenJPEG, which are far from being fast. Just test that software to check the latency and the performance. It's not surprizing, as soon as J2K algorithm has very high computational complexity. If we implement multiple threads/processes on CPU, the performance of J2K solution from libavcodec is still unsuffcient. This is the problem even for 8-bit frames with 2K resolution, though for 4K images (12-bit, 60 fps) the performance is much worse.

The reason why FFmpeg and other software are not fast at that task is obvious - they are working on CPU and they are not optimized to be high performance software. Here you can see benchmarks comparison for J2K encoding and decoding for OpenJPEG, Jasper, Kakadu, J2K-Codec, CUJ2K, Fastvideo codecs to check the performance for images with 2K and 4K resolutions (J2K lossy/lossless algorithms).

Maximum performance for J2K encoding and decoding at streaming applications could be achieved at multithreaded batch mode. This is a must to ensure massive parallel processing according to JPEG2000 algorithm. If we do batch processing, it means that we need to collect several images, which is not good for latency. If we implement batch with multithreading, it improves the performance, but the latency gets worse. This is actually a trade-off between performance and latency for the task of J2K encoding and decoding. For example, at remote color grading application we need minimum latency, so we need to process each J2K frame separately, without batch and without multithreading. Though in most cases it's better to choose acceptable latency and get the best performance with batch and multithreading.