With Gemini and Claude going after GPT’s meal, it’s now Inflection’s turn to deliver GPT-4 performance efficiently with their new Inflection 2.5 model. We’re seeing a flurry of competitive solutions rapidly entering the space and the time where GPT was the undisputed leader seems long gone now.
Looks like this will be the year of open-then-cancel subscriptions, if only we could sign-up once to always get the best deal and let the AI behemoths do their things behind the scenes right?
**cough**maybe-something-along-those-lines-is-coming-soon-but-don’t-go-saying-I-said-something**cough**
Node prompting
New interfaces like LangChain, LMQL, and DSPy, have emerged to help focus on front-end design when managing prompt templates to interact with language models, while back-end inference engines like NVIDIA TensorRT-LLM, Hugging Face TGI, Orca, and vLLM reduce latency and improve throughput. SGLang is a new framework that combines both front-end and back-end components to facilitate the programmatic usage of LLMs.
Why would you care? - Front-end and back-end tools are evolving separately, leading to missed opportunities for performance optimization. Having a single framework combining both layers would help simplify your stack by enabling both user-friendly prompting, and optimized inference. You can have your tokenized cake and eat it too.
dimensions = ["Clarity", "Originality", "Evidence"]
@function
def essay_judge(s, essay):
s += "Please evaluate the following essay. " + essay
# Evaluate an essay from multiple dimensions in parallel
forks = s.fork(len(dimensions))
for f, dim in zip(forks, dimensions):
f += (
"Evaluate based on the following metric: " +
dim + ". End your judgement with the word 'END'")
f += "Judgment: " + f.gen("judgment", stop="END")
# Merge judgments
for f, dim in zip(forks, dimensions):
s += dim + ": " + f["judgment"]
# Generate a summary and give a score
s += "In summary," + s.gen("summary")
s += "I give the essay a letter grade of " +
s += s.gen("grade", choices=["A", "B", "C", "D"])
ret = essay_judge.run(essay="A long essay ...")
print(ret["grade"])
Implementation of a multi-dimensional essay judge using SGLang primitives. The function uses LLMs to evaluate the quality of an essay from multiple dimensions, merges the judgments, generates a summary, and assigns a final grade. Source: Original Paper
How does it work? - SGLang is an embedded domain-specific language in Python that provides primitives for managing the state of a prompt and generating text. An SGLang program can be executed through an interpreter, where primitives are submitted asynchronously into a stream for execution, or compiled as a computational graph for more advanced optimizations.
The SGLang interpreter treats a prompt as an asynchronous stream and manages each prompt with a stream executor in a background thread, allowing for parallelism. When fetching generation results from a prompt, the fetch action is blocked until the desired generation results are ready. SGLang can also be compiled as a computational graph, where each call to a decorated function or fork creates a new prompt state, or a stream. The graph includes nodes for primitive operators and edges for dependencies, and is executed through a graph executor, which launches stream executors for each data stream and dispatches IR nodes to the streams in topological order.
SGLang also uses RadixAttention, a novel technique for automatic KV cache reuse during runtime. RadixAttention retains the KV cache for both prompts and generation results in a radix tree, enabling efficient prefix search, reuse, insertion, and eviction. It is compatible with existing techniques like continuous batching and paged attention.
Finally, SGLang introduces additional optimization techniques like cache-aware scheduling, used to increase the cache hit rate by sorting requests by matched prefix length instead of following a first-come-first-serve schedule, and new CUDA kernels to enhance the efficiency of KV cache reuse.
Check out the repository to get started.
The Lab
Look Attentively
Deploying LLMs on CPU is computation-bound, with the main bottleneck being multiply-add (MAD) operations used in calculating attention scores.
NoMADAttention is a new paradigm for fast attention computations on CPUs. Instead of using Matrix Multiplication and Addition (MMA) operations, it converts dot product computations to memory lookups using Product Quantization (PQ).
PQ divides a high dimensional vector into subvectors, quantizes them separately, and stores their indices as codes in a lookup table (LUT). When computing the dot product between two vectors, it looks up the distances between the subvectors' codes and their respective cluster centroids in the LUT and accumulates these distances to estimate the overall dot product.
NoMADAttention reduces the number of multiply-accumulate (MAD) operations required for attention computations, leading to improved performance on CPUs.
Core Training
Pre-training and fine-tuning LLMs is memory-intensive as it requires storing billions of parameters along with their gradients and optimizer states. While Low-Rank Adaptation (LoRA) techniques help mitigate this issue, they involve making a compromise on performance.
GaLore is a new, memory-efficient training method that uses gradient low-rank projection to train language models from scratch on lower resource devices.
Instead of storing the full gradient when training, GaLore stores the statistics of a small "core" of the gradient in the optimizer states. This is done by projecting the gradient onto a lower-dimensional subspace using projection matrices, and then updating the weight matrix using the projected gradient. Contrary to LoRA, GaLore explicitly utilizes low-rank updates, without altering training dynamics. As a result, it requires less memory than LoRA during training, as it doesn't need to store separate low-rank adapters.
GaLore is able to considerably decrease memory consumption in optimizer states by as much as 65.5%, all while keeping both efficiency and performance high during large-scale pre-training and fine-tuning of LLMs.
More Cutting-Edge Research
If you can see it, you can learn it - Simulated Trial and Error (STE) is a new biology-inspired method for efficient tool learning by LLMs. STE consists of two phases. In the exploration phase, the LLM interacts with a new API to (1) imagine a plausible user query, (2) try to fulfill the query by interacting with the API, and (3) reflect to improve subsequent exploration. In the exploitation stage, the trials are used to enhance the tool-use ability of the LLM via either fine-tuning or in-context learning (ICL). Simulated learning is shown to greatly improve LLMs' tool learning capabilities, further increasing their potency as all-purpose models.
Learn when to stop - A new paper shows that not all layers of an LLM are necessary during inference, and that varied task difficulties result in activating different layers. The authors then introduce AdaInfer, an adaptive early-exit strategy method aimed at improving inference efficiency by implementing instance-aware inference. At each layer, AdaInfer computes a stopping signal, a Feature Selection module creates a feature vector for the current input instance, and a Classifier module assesses the strength of the stopping signal, triggering early termination of inference if the signal is strong enough. AdaInfer is able to preserve the model's innate abilities without changing parameters, avoiding potential risks of compromising generalization capabilities.
Crisp clear - NaturalSpeech 3 is a new text-to-speech synthesis model that produces natural-sounding speech with better quality, similarity, and controllability than current SOTA. It consists of a neural speech codec (FACodec) and a factorized diffusion model. The FACodec converts the speech waveform into subspaces representing content, prosody, timbre, and acoustic details, and then reconstructs the waveform using these attributes. The factorized diffusion model generates each attribute representation, which is then used by the FACodec decoder to reconstruct the waveform. Performance aside, the model is also shown to enable speech attribute manipulation, by customizing speech attribute prompts.
The Pulse
Radio Feed - Google is exploring the use of radio waves to feed its AI models. The approach involves training audio-based models on streams from numerous global radio stations, employing deduplication techniques to prevent overfitting and ensure a wider range of phrases and languages are learned.
H2F2 - Hugging Face is launching an open robotics project that aims to design, build, and maintain open-source and low-cost robotic systems that integrate AI, with key objectives ranging from building low-cost robots using off-the-shelf components and 3D-printed parts to integrating deep learning and embodied AI technologies into robotic systems.
PlusUltra - AMD has released the latest addition to its family of FPGAs, the Spartan UltraScale+, designed specifically for IoT devices. The Spartan UltraScale+ family boasts several advantages, including high input/output (I/O) ratios, low power consumption, improved fabric performance, and advanced security features such as NIST-approved algorithms for post-quantum cryptography. These attributes make the Spartan UltraScale+ suitable for a variety of applications, including industrial robotics, healthcare, and video processing.
And that’s all for this edition, we hope you enjoyed reading through!
The Unify Dev Team.