The Deep Dive Issue #3

DeepCache, beyond transformers, and hardware explorations

Dec 13, 2023

While Gemini is making the headlines with its ~~controversial demo~~ improved performance over GPT-4, both industry and academia are doubling down on their efforts to find the next frontier development in AI, and it’s as exciting as always!

As usual and before you dive in, check out our latest blog post this time on The Database, our fully open-source effort to document the deployment space and your next one-stop for keeping up to date with the latest in deployment tools, libraries and platforms. Read more through the link below!

So, do you still redraw the whole thing each time?

It would be an understatement to say that AI image generation has witnessed an explosive growth in the past few years, mostly fueled by the advent of diffusion-based generation which iteratively builds coherent visual content starting from noise. While diffusers have proven to be effective, the process remains inherently slow as it requires multiple denoising steps performed sequentially which limits parallelization options. DeepCache is a novel technique that aims to address this inefficiency by dynamically compressing diffusion models during runtime, without the need for retraining.

Why would you care? DeepCache boasts a hefty 2.3x speed-up on Stable Diffusion 1.5 and 4.1x for LDM-4-G, both with minimal accuracy loss, outperforming existing pruning and distillation techniques requiring retraining. Best thing, using it is straightforward. (Did I mention it’s also a ComfyUI plug-in?).

import torch
from DeepCache import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained('runwayml/stable-diffusion-v1-5', torch_dtype=torch.float16).to("cuda:0")
prompt = "a photo of an astronaut on a moon"
deepcache_output = pipe(
    prompt, 
    cache_interval=5, cache_layer_id=0, cache_block_id=0,
    uniform=True, #pow=1.4, center=15, # only for uniform = False
    output_type='pt', return_dict=True
).images

How does it work? The approach is based on a fundamental observation that the high-level features extracted from an image during each successive denoising step tend to display strong temporal consistency. In simpler terms, each subsequent step contributes little to the previously extracted features. Think of it like drawing, because you incrementally add stuff to your sketch, it tends to look similar from one step to the next, but increasingly drifts from that appearance as time passes (deliberately avoided inferring any notion of improvement there, in memory of all the stick drawings that started with a good intention). Because these high-level features are cacheable, this can be exploited to avoid the systematic regeneration of the analogous feature maps and, consequently, diminish resource consumption.

On a technical level, DeepCache borrows the consecutive downsampling-upsampling process from U-Net. Essentially, U-Net is an image segmentation architecture made of a contracting path that learns to classify objects, and an expanding path that learns to localise the classifications in the image. At each step of the expanding path, a convolution is applied to the concatenation of (a) the output of the mirroring layer from the contracting path, and (b) the deconvoluted (up sampled) output of the previous layer from the expansion path. Using U-Net in this context, high-level features from an image can be cached while updating the low-level features at each subsequent denoising step (corresponding to the expanding section of the U-Net).

Further, because the temporal consistency we mentioned above doesn’t just strictly apply to successive steps but is also generally valid N-steps further (albeit with decreasing consistency the further you go), this can be used to reuse cached features multiple times and speed-up generation even more. Importantly, because DeepCache is complementary to existing techniques, it has the potential to further accelerate current SOTA such as SDXL Turbo which already manages to make few-steps predictions!

Check out the repo to get started on using it.

The Lab

Choose what you remember and do it well. From Flash Attention paper authors, Mamba is a new architecture boasting 5x throughput compared to Transformers of the same size, and better performance than those twice the size on language modelling! Mamba addresses Transformers’ computational inefficiency on long sequences by:

Introducing selectivity to structured state space models (SSMs, which scale well but have issues with modelling information-dense data like text) by making their parameters input dependent, thus allowing the model to filter out irrelevant information in an attempt to solve the typical tradeoff between computation efficiency through smaller state, and context size through larger states.
Using a hardware-aware algorithm leveraging properties of modern accelerators to materialise the state only in more efficient levels of the memory hierarchy by applying kernel fusion to reduce the amount of memory IOs, leading to speedups.
Simplifying the model sequence by combining elements from typical SSMs with MLP blocks from modern neural networks.

With better scaling, hardware-aware efficiency and a simpler structure, Mamba promises to further accelerate training and inference and enable applications restricted by current compute requirements limitations.

Multi-In Multi-Out. Large Neural Networks continuously reach SOTA performance but come with increased computational complexity as inputs grow. To reduce per-input computational costs, researchers from IBM Research and ETH Zurich propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) to process inputs concurrently. First, each input is bound to a randomly generated key. The key-value pairs are then superimposed and the result is passed to the network. Because the bound inputs are quasi-orthogonal (almost linearly independent), they can each be processed in a separate protected subspace (channel) with reasonable approximation — all with a single function call. The inputs are then retrieved with minimal added noise from the superimposed value once processed, and the network’s outputs are associated with the corresponding inputs based on the protected keys. Building on top of MIMONets, MIMOConvs and MIMOFormers respectively achieve 2-4x speed-ups over WideResNet CNNs and handle 2-4 Transformers inputs at once, both with minimal accuracy loss. Further, MIMONet mitigates the increased noise from input expansion by allowing dynamic selection of the sub-channels to superimpose, making it highly customizable to fit the user’s minimum targeted accuracy.

The Pulse

Back’n’buying. Not long after his coming back into office, OpenAI agreed to buy chips from CEO Sam Altman-backed startup. While the deal has raised scrutiny given the identity of some of the start-up’s investors, OpenAI’s investment strongly demonstrates its willingness to secure the supply necessary for exploring new development avenues. Rain, the start-up at the other end of this contract, is working on building neuromorphic processing units (NPUs) that use physical artificial neurons to perform computations, inspired by the structure and function of the human brain.

Knock Knock. Who’s there? Competition! AMD recently put an end to the speculation regarding the relative performance of its MI300X hardware it announced a while back. During the AMD keynote last week, MI300X was showcased as the next fastest hardware on the market, beating NVIDIA’s H100 by 10-20% during inference while being on par in terms of training time. While 2024 is expected to witness more competing hardware including NVIDIA’s H200, this remains a strong testimonial that innovation can even shake markets where supply chain dynamics can create lasting imbalances, as it has been so far.

And that’s all for this edition, we hope you enjoyed reading through!

The Unify Dev Team