Tough times for us devs as we observe from the forefront how the next generation of human software engineers is learning to hold a spoon to come for our meals. Kind of funny also how hours later, the go-getters among us who joyfully entertained a reconversion into manual labor had to recalibrate their career yet again after getting an early preview of tomorrow’s kitchen bot; it’s as though the big purses are racing to make us obsolete on all fronts.
But you know, we’re not really into doomsaying and when you take a step back, it’s all still fairly limited tech. Goes without saying that it’ll get better with time, but a place will remain for actual human engineers, to solve new kinds of problems, using new sets of skills. Until then, if you’re getting worried, remember that the Devin isn’t in the details (all this build up just for this pun yes).
A moment's attention
Serving SOTA LLMs is a compute-intensive task. While existing acceleration tools tackle this issue by optimizing for GEMM and GEMV operators, few focus on the self-attention mechanism that lies at the core of transformer architectures. FlashInfer is an open-source library that addresses this by explicitly focusing on improving the performance of self-attention.
Why would you care? - FlashInfer distinguishes itself from existing libraries by providing comprehensive Attention Kernels covering all common use cases of LLM serving. Making it an efficient Swiss-army knife for kernel optimization.
How does it work? - Self-attention operations are broken down into three stages. The prefill stage where attention computation occurs between the KV-Cache and all queries; the decode stage where the model generates one token at a time while computing attention only between the KV-Cache and a single query; and the append stage where attention is computed between the KV-Cache and queries of appended tokens, or during speculative decoding.
FlashInfer optimizes each of these stages by implementing, for each stage, single-request and batch versions of FlashAttention; which fuses multi-head attention into a single kernel. Each stage can be optimized as below:
import torch
import flashinfer
kv_len = 2048
num_kv_heads = 32
head_dim = 128
k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
# decode attention
num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)
o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly
# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(0) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask
# prefill attention
qo_len = 2048
q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(0) # prefill attention
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask
Minimal code example using FlashInfer to optimize attention kernels. Source: Repository
FlashInfer further improves the serving efficiency of KV-Cache compression techniques by optimizing kernels for:
Grouped Query Attention: Which uses a smaller number of heads for keys and values, saving memory traffic.
Fused-RoPE Attention: Which fuses Rotary Positional Embeddings (RoPE) into attention kernels, applying RoPE on the fly with negligible overhead.
Quantized Attention: Which implements low-precision attention kernels to achieve nearly linear speedup to the compression ratio.
Finally, FlashInfer also optimizes PageAttention kernels by prefetching page indices in GPU shared memory, so that kernel performance is not affected by the page size. Combined, these optimizations allow FlashInfer to achieve significant speed-ups versus various baselines.
Check out the repository to get started.
The Lab
Right on track
Despite the benefits of data selection during pretraining and instruction fine-tuning stages, maximizing data efficiency in supervised fine-tuning remains a challenge as pretrained models may struggle to provide good feature representations if the fine-tuning data has a distribution shift, which is often the case in specialized domains.
SMALLTOLARGE (S2L) is a novel approach to data selection for supervised fine-tuning of LLMs in specialized domains that tackles this challenge. S2L gathers training trajectories for each training example using smaller models, clusters them, and employs balanced sampling from these clusters. This ensures adequate representation of learning patterns that might be missed in uniform or weighted sampling.
Results show that S2L significantly reduces required training data size and improves performance compared to current state-of-the-art one-shot and online data selection algorithms.
Visual anchors
Traditional Vision-Language Pre-Trained (VLP) models rely on pretrained object detectors to extract region-based image features. While effective, these models suffer from extensive annotation and expensive computation due to the training of object detector models.
Text-Relevant Image Patch Selection (TRIPS) is a new method that aims to minimize the computational cost of visual encoding and cross-modal fusion by selecting text-consistent image tokens through a text-aware patch-selection layer. This layer dynamically computes text-dependent visual attention, allowing it to identify attentive image tokens with text guidance and fuse inattentive ones. The method is based on the hypothesis that not all image tokens in the visual encoder positively contribute to the final prediction results of VLP models, and that a significant number of redundant image tokens exist that can be merged. The idea is to get a sparse image that only contains informationally important parts.
By integrating TRIPS into existing representative VLP models, approximately 40% efficiency gains can be achieved while maintaining competitive or superior downstream task performance.
More cutting-edge research
Number crunching - Chronos is a timeseries forecasting model that uses LLMs to process numerical entries. Chronos tokenizes time series data into discrete bins through simple scaling and quantization of real values, allowing off-the-shelf language models to be trained on this "tokenized" corpus with no changes to the model architecture. Chronos can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning language models as viable forecasting tools.
Hands-free tuning - Training using LoRA requires careful and manual parameter tuning. AutoLoRA introduces a meta learning-based framework that automatically identifies the optimal rank of each LoRA layer. Each rank-1 matrix in a low-rank update matrix is associated with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning-based method is then employed to learn these selection variables, and the optimal rank is determined by thresholding the values.
De-composable - PrimeComposer is a novel approach for image composition that focuses on subject-based local editing. It uses a latent diffusion model to edit local foreground areas to match a provided object and text, and spatially combine the areas with a noised background. A correlation-diffuser then combines noisy versions of the object and background, segments the synthesized subject, and captures details and relationships between the subject, object, and background to enable fine-grained, surrounding-aware composition focused on the subject.
The Pulse
Cmd + Rag - Cohere AI has released a new language model called Command R, which focuses on Retrieval Augmented Generation (RAG) to automate complex workflows. The model aims to enhance internal knowledge for customer service purposes and improve productivity by integrating with existing Cohere models and third-party tools. Users can immediately access the model via Cohere's API, and it will soon be available on major cloud platforms.
Would you like more transistors on your transistors? - Nvidia unveiled its latest "Blackwell" GPUs at the 2024 GPU Technical Conference. With 208 billion transistors, Blackwell GPUs are expected to provide significant improvements in terms of raw floating point performance and memory capacity compared to previous generations, and are designed to handle the increased computational requirements of AI, including the rise of multimodal, larger, and more sophisticated generative models.
Out in the wild - On March 18, 2024, Elon Musk's startup xAI open-sourced its LLM, Grok-1, making it publicly available under the Apache 2.0 license. With 314B parameters, Grok-1 is an MoE model trained from scratch by xAI within three months of the company's founding, and is intended to be a more open and humorous alternative to ChatGPT. The release includes both weights and code but omits original training data
And that’s all for this edition, we hope you enjoyed reading through!
The Unify Dev Team.