Fast image loader for Deep Learning Frameworks
Fastvideo image loader for Deep Learning Frameworks is intended to create processing pipeline, which could be integrated into different deep learning training and inference applications. This is very fast loader, which can significantly improve the performance of image series uploading before training.
That library accelerates all preprocessing stages of input data for deep learning applications. The main idea is to offload all preprocessing to GPUs, in order to achieve better performance for training and inference. It could be considered as analog to DALI, though we don't do anythig on CPU - in our case everything is done on GPU.
How we can do that
To offer fast image loading, we need to make standard input data. In general, input images could be stored at various image formats and they could have different types of compression as well. That's why we need to prepare our data set. We need to do that just once, but later we will work with that new data set on each iteration of training.
We can utilize CPU-based ImageMagic or any other software which supports many different image formats. Then we need to open each frame from the original set and save it to jpg format. Please note that for fast JPEG decompression, we need to add sufficient number or restart markers to each jpg image at encoding stage. Finally we get image set which consists of jpegs with built-in restart markers.
Now we can start working with that new set. We can load jpegs to GPU very fast because they are compressed, so they have small file sizes. After that we can get fast JPEG decoding on GPU. We can also add random resize and color image transform (image augmantation) which are frequently utilized to create more choices for neural network training.
To optimize the solution, it's a good idea to load not just one image, but a batch of images at the same time to get better loading time per image and to use batch decoding to get it faster. We can also run several threads or processes on each GPU to achieve better GPU occupancy. On just one Tesla V100 GPU at multithreading application, we can process in such a way around 1,000 jpg images per second, with resolution around 1 MPix (but with more complicated pipeline).
Quite often for training purposes we don't utilize data sets with 1 MPix images, because less resolution could be enough to accomplish the task. That's why we need resolutions like 480p (which means 640x480) or less. In comparison with 1 MPix image, in such a case we need 3 times less data and we could possibly reach data loading performance up to 2,500 - 3,000 images per second on just one Tesla V100 GPU. This is our goal and we will try to find a solution.
At the moment, according to info from NVIDIA, their DALI system can offer image loading for neural networks with performance up to 23,000 images per second on DGX-2, which has 16 GPUs Tesla V100. It means that DALI can offer the performance per Tesla V100 GPU around 1,500 images per second. This is the benchmark that we are going to improve.