What would you do with your own AGI? Might seem like a question you’d hear as you come out of a movie theatre. But it might not be that sci-fi of a future, at least that’s how Meta’s CEO makes it look as they recently announced a significant investment in artificial general intelligence (AGI) which may end up being open sourced. You can bet everyone’s freaking out at the prospects of having such tech released to the open public, but if we’re soon getting our own Jarvis, might as well welcome the perks.
This week we’re resuming our blog posts series on model compression, focusing on quantization. If you haven’t given the overview post a read yet, we highly recommend you give it a quick read before diving into this weeks’!
Departments of Experts
Mixture of Experts (MoE) has emerged as a powerful and adaptable paradigm particularity for scaling language modelling capabilities, but despite its computational efficiency, the classical routing procedure of MoE remains suboptimal. ExFlow provides a systematic approach to optimise expert routing to further accelerate MoE-based models.
Why would you care? ExFlow can be applied to pre-trained models to achieve up to 67% reduction in cross-GPU routing latency and an overall throughput improvement of up to 120% over existing MoE optimisations, without compromising on model accuracy. This makes ExFlow useful notably for deploying larger-scale MoE models in resource-constrained set-ups, just by optimising resource allocation.
How does it work? At a high level, MoEs operate by distributing tasks among specialised experts within a broader model architecture, dynamically routing inputs to the most suitable expert based on the input context. Because accommodating all experts on a single GPU is generally not an option due to the large feed-forward network (FFN) required for each expert, the approach typically requires using parallelism strategies to alleviate memory requirements. Practically, this implies dispersing the experts across multiple GPUs and, when running inference, having each GPU scatter its input to experts on other GPUs, before gathering them back after computation. These all-to-all operations (where each GPU receives and sends data to each other GPU) introduce significant overhead.
ExFlow is based on exploiting insight drawn from analysing the expert selection pattern of a typical MoE. The experimental analysis reveals that expert selection is not random and that routing decisions in previous layers largely affect routing decisions for subsequent layers. As a result, ExFlow accelerates a given model by quantifying this expert affinity and optimising it to minimise routing overhead. The design is based on:
Ensuring Token Coherence: When making predictions, LLMs iteratively append generated tokens to the input tokens in order to gradually build context for subsequent generations. Because each GPU in an MoE has different token contexts, they all need to gather back their inputs to perform the final attention operation in order to stay consistent with the initial context of their respective inputs. Ensuring token coherence (where all GPUs share the same contexts) is therefore critical to remove this output-gathering constraint such that the final operations can be performed on any GPU regardless of the input provenance. To achieve this, ExFlow applies an AllGather operation (where each GPU broadcasts its context to the other GPUs) both at the start and end of an inference step, ensuring that context remains coherent across GPUs at any given step.
Modelling Experts Affinity: Based on a sample of test tokens, a heatmap of consecutive layers’ expert selection in a target pre-trained model is built by calculating, across experts in consecutive layers, the conditional probability of an expert to be selected given (a) the previous expert and (b) the context tokens processed so far. The optimisation goal is then to find the expert for which this affinity with the current expert is greater than the affinity of all other experts with the current expert. Finding this optimum allows strategically placing closely affiliated experts on identical GPUs to minimise inter-GPU routing. This approach can be extended to larger GPU clusters that use NVLINK for intra-node communication such that affiliated experts are held at an intra-node GPU where communication is faster.
While not available yet, the code will be made public in this repository.
The Lab
All at once - Supervised finetuning (SFT) has shown great potential in improving the zero-shot performance of LLMs, but scaling up instruction datasets can be expensive and time-consuming, especially with manual annotation. To improve annotation efficiency, active learning techniques incrementally add batches of samples to the labelled pool by training a model on the currently labelled data and selecting new batches based on the model's measure of informativeness. While effective, the process remains costly when working with parameter-heavy LLMs. Researchers from several universities as well as Microsoft explored the use of "experimental design" for SFT. The method selects the optimal set of instructions to annotate in a single step based solely on the initial model and allows for significant gains in labelling efficiency with little overhead. With experimental design, the learner is given a set of initial prompts, and a selection strategy chooses a subset of prompts based on different approaches. Uncertainty-based selection chooses examples with the highest uncertainty (as defined by different measures like mean entropy and least confidence for e.g) to the model, while k-Center selection annotates prompts that are diverse in the representation space. Well-written responses to prompts are then collected from annotators, and the selected prompts are used to fine-tune the model. The results show that this approach can achieve annotation cost savings of up to 50% compared to random sampling while maintaining the same level of performance.
Give it a spin - By introducing uncertainty estimates, Bayesian Neural Networks (BayNNs) can mitigate the prediction overconfidence issue of typical neural networks. However, current hardware architectures have memory bottlenecks that limit the use of BayNNs in practical high-demand applications. Computing-in-Memory (CIM) architectures have emerged as a promising solution to this problem as they can perform the multiply and accumulate operations where data already exists, thus eliminating memory bottlenecks. Specifically, the introduction of spintronics-based technologies (which manipulate the intrinsic spin of electrons in materials) can be integrated into CIM architectures to achieve low power consumption and high computational efficiency. However, manufacturing variations and defects, as well as the stochastic behaviour of spintronic memories, pose significant challenges to implementing BayNNs on CIM architectures. The NeuSPIN initiative aims to overcome these challenges by developing a reliable neuromorphic framework for energy-efficient AI processing. As part of the initiative, several alternative networks are explored that aim to improve the efficiency of BayNNs through the lenses of spintronics. One set of projects aim to integrate probability-based dropouts modules (leveraging the stochastic properties of spintronic devices) to reduce energy consumption while maintaining predictive performance and uncertainty estimates, while another set explores the application of variational inference (typically used in machine learning to deal with very complex computational tasks by approximating inference) to improve inference precision and accuracy of uncertainty representation. Besides consistent improvements over various metrics (inference accuracy, out-of-distribution data detection, etc.), significant reductions in energy consumptions (up to 100x) and memory overhead (158.7x) are achieved from proposed implementations. This provides an avenue for practical use of Bayesian neural networks in real-world and safety-critical applications.
More cutting-edge research:
UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer - Conventional channel-wise pruning methods face challenges when applied to depth-wise convolutional layers, as they result in sparse computation and fewer parameters. To address these issues, reparametrization methods have been proposed to reduce model depth but these methods may compromise the integrity of baseline model weights, cannot prune models with certain layers such as LayerNorm or GroupNorm, and cannot be applied to vision transformer models. Researchers from AMD introduced UPDP, a novel depth-pruning approach which aims to overcome these limitations. UPDP begins by modifying the model’s blocks of layers by introducing new layers into the architecture with the aim to enforce the necessary conditions for reparametrization. It then uses a search algorithm to find the optimal mix of blocks to prune. Finally, the optimal subnet is trained using a progressive training strategy and, post-training, reparameterization is used to merge the introduced layers during the blocks transformations step to make the subnet shallower.
Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models This paper proposes a fast and memory-efficient framework for large language model (LLM) finetuning called Quantized SideTuning (QST). QST combines parameter-efficient finetuning (PEFT) and quantization to reduce the computational requirements of LLM finetuning. Unlike PEFT methods like QLoRA that require caching intermediate activations (which can be large when finetuning with large batch sizes), QST introduces a side network separated from the quantized LLM to avoid backward propagation for the quantized LLM and save the memory footprint of intermediate activations as a result. QST also uses low-rank adapter methods to reduce the trainable parameters and memory footprint for the optimizer states. During training, the input to each layer of the side network is formed by combining the downsampled output of the corresponding quantized LLM layer and the output of the previous layer of the side network. A learnable parameter is then used to aggregate the hidden states of the quantized LLM and the side network, and the LLM head or classifier is reused for predictions. QST manages to reduce memory footprint by up to 2.3x and speed-up fine-tuning by 3x while achieving competitive performance compared to the SOTA methods.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model The paper proposes a new vision backbone called Vim, which uses bidirectional Mamba blocks to represent visual data efficiently. Unlike traditional vision transformers that rely on self-attention, Vim uses position embeddings and bidirectional state space models to compress the visual representation, resulting in improved computation and memory efficiency. On various image classification, object detection, and semantic segmentation tasks, Vim outperforms established vision transformers like DeiT while being significantly faster and more memory-efficient. The results suggest that Vim has the potential to become a critical backbone for vision foundation models, particularly for high-resolution images.
The Pulse
Fighting fire with fire - Intel is seeking to patent a system for managing renewable energy in data centres. The system would use AI and machine learning to predict the availability of renewable energy and allocate it to certain processes, workloads, or devices. This would allow data centres to optimise their use of renewable energy and reduce their reliance on non-renewable sources. This patent is significant because data centres are major consumers of energy, and this technology could help reduce their environmental impact. Additionally, the use of AI to manage energy consumption represents a new approach to power regulation and could be a step towards more sustainable energy management in the future.
Less for more - Researchers at Pacific Northwest National Laboratory (PNNL) and Microsoft have used artificial intelligence AI to identify a new material that could reduce the amount of lithium used in batteries by up to 70%. The material, a blend of sodium, lithium, yttrium, and chloride ions, was identified from 32 million possible candidates using AI techniques. The team used a combination of AI and physics-based models to filter the materials based on their properties, and then synthesised and tested the final materials. The top-performing candidate was found to be an order of magnitude less conductive than current liquid electrolytes, but the researchers were still able to build a working prototype that powered a lightbulb. This discovery could be critical as lithium batteries continue to power an increasingly growing number of (AI-embedded power-hungry) devices.
And that’s all for this edition, we hope you enjoyed reading through!
The Unify Dev Team