Welcome back! We hope you had a nice end of year break and you're looking forward to kick-off this new season of getting dizzy at the pace of AI news exciting AI breakthroughs as we are.
Grab your favourite drink and let's dive into this year's first set of news. As usual, feel free to give our new blog post a read before you take the dive. This time, we're starting a series on the fragmented space of model compression and we begin with an introductory post to give you some context.
Can’t stop diffusin’
Diffusers provide a powerful and flexible framework for modelling complex data distributions, which quickly made them the go-to for image generation. Because they operate sequentially, they are also inherently slow and became the subject of several studies that attempt to accelerate their speed and unlock their full potential. We've covered one such technique (DeepCache) in a previous edition and we're now looking at another tool that takes a different approach to tackle the issue. Instead of optimising the sequential denoising, StreamDiffusion batches the denoising steps and expands this with several techniques that improve concurrent input processing to enable real-time diffusion.
Why would you care? StreamDiffusion outperforms the original diffusion Autopipeline from Diffusers by up to 59.6x while reducing GPU power usage by 2x on average. You can also use it to improve SDXL Turbo to get real-time output in case you're already using it. Finally, StreamDiffusion is compatible with HuggingFace's diffusers library and can be integrated into your pipeline with a few lines of code. For e.g:
import torch
from diffusers import AutoencoderTiny, StableDiffusionPipeline
from diffusers.utils import load_image
from streamdiffusion import StreamDiffusion
from streamdiffusion.image_utils import postprocess_image
# Load pipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/sd-turbo").to(device=torch.device("cuda"), dtype=torch.float16)
stream = StreamDiffusion(pipe, t_index_list=[32, 45],torch_dtype=torch.float16)
# Enable acceleration
pipe.enable_xformers_memory_efficient_attention()
# Run stream and inputs
prompt = "An astronaut riding a unicorn"
stream.prepare(prompt)
init_image = load_image("your_example_image").resize((512, 512))
# Run the stream infinitely
while True:
x_output = stream(init_image)
postprocess_image(x_output, output_type="pil")[0].show()
input_response = input("Press Enter to continue or type 'stop' to exit: ")
if input_response == "stop":
break
How does it work? The core idea behind StreamDiffusion is to make the denoising process more efficient. One technique involves the use of Stream Batching, a method that batches denoising operations and processes them in parallel. Each batch consists of a predetermined number of denoising steps, and the size of each batch is determined by the number of steps. Using stream batching, the processing time does not increase linearly with the number of steps, and the trade-off shifts from balancing processing time and generation quality to balancing VRAM capacity and generation quality.
Within each batch, the number of denoising-related computations is also reduced using Residual Classifier Free Guidance (RCFG), a novel approach that aims to address the limitations of Classifier-Free Guidance (CFG) and allow generating high-quality images with fewer computations.
Classifier-Free Guidance? - If you've played around with diffusion models you might have come across this parameter which determines how much you want the model to stick to your prompt or deviate from it. In a standard diffusion model, class labels or text prompts used as conditional information are usually provided by a separate classifier model. Classifier-free guidance (CFG) is a method that employs a single model for both image generation and predicting the conditioning information.
CFG requires running an additional U-Net inference for each input latent variable to compute the negative conditioning residual noise responsible for the mean-reversion process during denoising, making it scale inefficiently. RCFG addresses this issue by explicitly mapping the latent input to a noise distribution and recursively generating the next step's noise using previous residual noise. This reduces the computational cost by requiring a limited number of U-Net computations, generally equal to the number of denoising steps.
Using an Input-output Queue, non-neural network operations like image resizing, tensor conversion, and normalisation are also parallelized, allowing the system to handle the difference in processing frequencies between human input and model throughput. The system retrieves the most recently processed tensor at each frame and forwards it to the VAE encoder, triggering the image generation sequence. Output tensors from the VAE decoder are fed into an output queue, and subsequent post-processing steps occur before the fully processed image data is delivered.
Sometimes, minimal changes occur between images. To process such scenarios efficiently, a Stochastic Similarity Filter (SSF) is used to conserve computational resources in instances of high inter-frame similarity. This technique calculates a distance similarity between the current input image and a reference frame, and uses the similarity to determine the probability of skipping subsequent processing steps, such as VAE encoding, U-Net, and VAE decoding. The probability of skipping is based on a threshold value, and if the similarity is above the threshold, the input image is saved and updated as the new reference image.
Inference speed is further improved by introducing several pre-computations. Specifically, the prompt embedding used for conditional denoising is pre-computed and cached, such that it is quickly accessed to compute the Key and Value pairs within the U-Net for each frame. These KV pairs are also stored for reuse, and updated whenever the input prompt is updated. The noise and noise coefficients for each denoising step are also pre-sampled and cached to ensure that every timestep retains the same noise despite each denoising step having distinct noise.
Finally, Compute Optimizations are introduced through the use of TensorRT to optimise the performance of the U-Net and VAE engines. TensorRT performs various optimizations such as layer fusion, precision calibration, and kernel autotuning to increase the throughput and efficiency of deep learning applications. A tiny AutoEncoder is also used which is a streamlined and efficient version of the traditional Stable Diffusion AutoEncoder able to rapidly convert latent inputs into full-size images and perform decoding processes with significantly reduced computational demands.
Check out the repo to get started on using it.
The Lab
Switch smart, predict fast - Neural network deployment on embedded devices is challenging given the network's high computing demands. Typical approaches to facilitate inference on such devices include partitioning the network across several devices for parallel computing or offloading to a cloud server with more resources. While effective, these approaches either involve lower accuracy or increased latency. AgileNN makes a targeted use of these techniques to achieve efficient inference on extremely weak devices with minimum latency. AgileNN uses explainable AI to only compress and offload the less important features without using expensive NN computations, while important features are retained at the local device level and processed by a lightweight NN. Feature importance is based on analysing the distribution skewness of features contributing NN inference, and predictions from both local and remote NN are combined for inference. AgileNN reduces NN inference latency by up to 6x and local energy consumption by more than 8x, while consuming 1.2x less memory and 5x less storage space. Importantly, because AgileNN directly incorporates the features of the input data, the system actively adapts to find the efficient processing of data based on the input distribution, making the system highly adaptable to any pipeline
Not just a GPUs world - Deploying Transformer-based LLMs at scale poses several (memory, latency, computational, energy, etc.) challenges, and much of the research has focused on GPU inference despite GPUs not being adapted to latency-sensitive workloads. Conversely, recent efforts to deploy on specialised hardware have widely used FPGAs as target, specifically spatial-based architectures with the potential to surpass GPU efficiency in small-batch low-latency scenarios using model-specific optimizations. Implementing a spatial architecture for LLM inference faces challenges among which the lack of standard LLM building blocks in hardware accelerators. This paper proposes LLM-specific kernels implementations of linear and non linear operators for efficient computing on FPGAs and builds on top of these contributions to propose a high-performance hardware accelerator for LLM inference. Using their system, they achieve speedups of 16.1x, 2.2x, and 1.1x for BERT and GPT generative inference stages compared to prior FPGA and GPU-based accelerators, and both 1.9x speed-up and 5.7x more energy-efficiency than an A100 GPU in the decode stage. The research provides an avenue for exploring non-GPU-based transformer acceleration, potentially unlocking greater performance at a smaller cost.
More cutting-edge research:
ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection This paper introduces ElasticTrainer, a new technique that enables fully elastic runtime adaptation of on-device training, achieving up to 3.5x more training speedup and 2x-3x energy reduction without noticeable accuracy loss, by dynamically selecting the trainable NN portion based on current training needs.
High Efficiency Inference Accelerating Algorithm for NOMA-based Mobile Edge Computing: This paper proposes a novel model offloading algorithm to accelerate edge inference across multiple devices in a network system. The algorithm considers both energy consumption and inference latency of the device and edge server to determine the optimal model split, sub-channel, and transmission power allocation strategies that achieve the best trade-off between latency and efficient energy usage.
The Pulse
Not NVIDIA's CUDA - Moore Threads, a company founded by a former NVIDIA executive and now China's fastest-growing domestic GPU supplier, has released a new single-GPU accelerator (MTT S4000) compatible with the CUDA framework through the company's self-developed translation tool. Designed for high memory capacity, it is suitable for large language models and can handle up to 96 video decoding streams at 1080p resolution. With its support of the CUDA ecosystem, the card allows for zero-cost migration of CUDA code to their platform, potentially making it a direct alternative to NVIDIA's own accelerators.
Ferrets eat apples too - Apple Inc. and Cornell University researchers recently released Ferret, an open-source multimodal large language model that can analyse image regions, detect and outline useful elements of the scene, and uses them to answer queries about the image. Beyond the technical viability of the model, this news further cements Apple's willingness to use open source as an asset when competing with other giants in the AI space, as it already did recently with its own ML framework which we covered in a previous edition.
Supercompeting - The EU has announced a plan to support homegrown competitive AI startups by providing them with access to supercomputers for model training. The program is expected to be fully implemented by 2024 and has France's Mistral AI as an early beneficiary. The program also includes several support centres to support the development of dedicated AI algorithms and provide guidance on using the provided resources. While the program's added value to participants remains to be seen, it demonstrates' the union's desire to make quantum computing an integral part of its AI strategy with possible consequences shaping the state of European AI research and industry trends.
And that’s all for this edition, we hope you enjoyed reading through!
The Unify Dev Team.