With Meta announcing their work on LlaMA 3, speculations are fusing on possible release timelines for GPT-5, commonly expected to possess reality-bending abilities (except for Anthropic CEO who seems to think otherwise). While this remains uncertain, 2024 will likely witness a massive upgrade in the accessibility of AI through even more mainstream applications and AI-embedded consumer devices, which may end up having even more of a societal impact than a brand new set of weights.
Do not look at her decoder heads
Several techniques have been developed to improve the inference speed of LLMs but memory constraints and adaptability to different systems is often an issue. MEDUSA proposes to accelerate LLM inference along with easy integration into current LLM systems, including distributed environments.
Why would you care? MEDUSA can speed-up inference by up to 2.3x-3.6x without compromising on output quality. Most importantly, it can be easily integrated into existing systems while removing some of the complexities involved with other acceleration tools. Plus you should feel at home if you're a coffee drinker.
How does it work? Auto-regressive decoding requires transferring the complete model parameters to the accelerator’s cache which creates a latency bottleneck. Speculative decoding can mitigate this issue but obtaining an appropriate draft model can be challenging. MEDUSA follows the same generating, processing and accepting candidates framework as speculative decoding, but integrates additional decoding heads capable of concurrently predicting multiple tokens.
Speculative Decoding? Speculative decoding generates multiple candidate completions of a prompt in parallel. The idea is to "speculate" about possible continuations of the prompt and generate several alternatives at once, rather than waiting for each token to be generated sequentially. The candidate completions are generated by a smaller draft model and refined by the base model. They are then ranked based on their likelihood or other criteria such as diversity or relevance with the highest-ranking completion selected as the final output.
The main building blocks of MEDUSA include:
MEDUSA Heads: These are decoding heads appended to the last hidden states of the original model. The heads are fine-tuned in a parameter-efficient manner, either alone to accommodate limited resources (MEDUSA-1), or alongside the backbone model to take advantage of its learned representations and ensure the distribution of the heads aligns with that of the original model (MEDUSA-2). Since the new heads consist of just a single layer akin to the original language model head, they don't add complexity to the serving system design.
Typical Acceptance: Unlike speculative decoding, MEDUSA doesn't use rejection sampling (which discards low-quality candidates with a certain probability) but rather a typical acceptance scheme to select plausible candidates. Typical acceptance chooses candidates that aren't exceedingly improbable to be produced by the original model, and the original model's prediction probability is used as a gauge to establish the acceptance threshold.
Tree Attention: MEDUSA uses a tree structure to process multiple candidates concurrently. With tree attention, only tokens from the same continuation are regarded as historical data. The tree structure can be refined by (1) using a calibration dataset to calculate the accuracy of the top predictions for different heads, (2) adding nodes to the tree by choosing the node connected to the current tree that has the highest accuracy, and (3) repeating the process until the desired number of nodes is reached.
Through self-distillation, MEDUSA can also be used to fine-tune models when the training dataset is unavailable or the output distribution was distorted by RLHF. This involves taking a public seed dataset from a domain similar to the target model and asking the model to reply to the prompts it contains. While typically enough for MEDUSA-1, this leads to quality deterioration for MEDUSA-2 but can be mitigated by using the original model’s probability prediction instead of using the ground truth token as the label for the backbone model.
Check out the repository to get started using it.
The Lab
The path to enlightenment alignment - Improving LLM alignment often faces limitations due to the expensive and limited availability of high-quality human preference data. To address this challenge, researchers from Meta and New York University introduced a self-improving training method that uses a comprehensive agent model with both instruction following and self-instruction creation abilities, to continuously update the inputs and reward model. In each iteration, the agent model generates candidate responses for new prompts, and the same model is then used to assign rewards via LLM-as-a-Judge prompting. Based on the generated data, the trained model subsequently refines its own training set by creating new prompts (sampled from the original seed data), candidate responses, and rewards for those responses, and the seed data is augmented with additional data from AI feedback. Experimental results indicate enhanced instruction-following performance and improved reward modelling capability when comparing self-rewarding alignment to the base model's. As the model undergoes iterative training, it can generate increasingly better preference datasets, potentially leading to more effective reward models surpassing those obtainable solely from human-generated data.
Not sure what you said, but makes sense - Input tokens can embed data across multiple modalities, making transformers useful for multimodal applications. Notably, a model's performance on a given modality can be improved using data from another modality, but this requires the data samples from the two modalities to be relevant to each other. Multimodal Pathway Transformer (M2PT) aims to remove this limit and allows improving a model using irrelevant data from another modality. Because transformers of different modalities may have the same transformer blocks despite using different tokenizers, M2PT augments every layer of a target model's transformer blocks with its counterpart from an auxiliary model with different modality. This allows the transformer layer of any model to process the same inputs as that of a different modality model with an equivalent transformer layer, on top of its own inputs. The system can be extended to include multiple models of distinct modalities to create a network of input-transformer pathways that extract the modality-complementary knowledge encoded in auxiliary models to improve the target model's knowledge. To mitigate the compute and latency overhead during training, re-parametrization is used to convert the connections between model structures into connections between the equivalent weights such that the re-parameterized model has the same number of linear layers as the original model.
More cutting-edge research:
Bytes is the new English - Byte sequences have been explored as an alternative to subword tokenization, but this approach typically involves longer sequences which scale poorly due to the quadratic cost of the attention mechanism built into transformers. MambaByte is proposed as an efficient byte-level language model that uses the Mamba architecture to remove this computational bottleneck and allow for the efficient use of bytes as a direct mapping from raw data to predictions without the need for tokenization. MambaByte outperforms prior byte-based models as well as traditional tokenizer-based transformers of larger sizes in terms of speed, efficiency, and overall effectiveness, making tokenizer-free training a viable alternative.
Go green, go quantum - Compression techniques can mitigate the computational footprint of LLMs but they introduce compression errors that are difficult to control. CompactifAI is a novel method that uses tensorization to truncate the correlations built into a model while controlling the degree of truncation. The novelty lies in the introduction of Matrix Product Operators (MPOs) which are typically used in quantum physics to represent quantum many-body systems (i.e physical systems made of a large number of interacting parts, each exhibiting quantum mechanical behavior). The decomposition is determined by performing singular value decomposition (SVD) on the weight matrices corresponding to the model's layers that are most suitable for compression, to only keep the necessary neurons for describing the system. Compressed models are typically between 15% and 30% of the original model's size (depending on float precision) with 90% accuracy preserved. Importantly CompactifAI offers a more controllable and explainable approach to LLM compression compared to traditional techniques.
Buckle-up for that long read - LLMs struggle with maintaining consistent context for text inputs larger than their context window size, and while methods exist to mitigate this issue by storing text information within the model's KV cache, this typically requires significant resources during training and / or inference. Instead, Temp-Lora stores context information in the model's parameters through a temporary LoRA module. During inference, tokens are generated chunk-by-chunk and, once the number of tokens hits a certain size, the most recent chunk(s) are iteratively used to train and update the temporary module before generating the next chunk. The module is then discarded at the end of the process to avoid impacting the model's parameters. To accelerate inference, KV states are only recomputed with the latest temporary module when the model reaches its context window size. The approach manages to improve the tested models' perplexity on various benchmarks and, most importantly, provides a flexible method for efficiently handling various-sized text inputs.
The Pulse
Or live long enough to become a fab - OpenAI CEO Sam Altman is actively exploring the creation of semiconductor manufacturing facilities to address growing demand in high-end chips, and is engaging in discussions with investors and company executives from around the world with potential partners including Samsung, SK Hynix, SoftBank, and G42. While the specific location of these factories is yet to be known, Altman has reportedly held talks with congressional representatives about constructing these new fabs in the United States, aligning with the country's intentions to boost local chip production and limit AI chip exports to China.
So small it must be inoffensive - Last year, Microsoft released Phi-2, a small language model (SLM) which demonstrated performance comparable to larger models from competitors despite being only 2.7 billion parameters in size, and the company is now dedicating an entire separate team to lead on the effort of developing more SLMs. Aside from capturing the market potentials of this untapped segment, developing independent SLMs is also a way for the company to reduce its dependence on OpenAI which it has mostly been relying on for providing its AI services. This could help Microsoft mitigate some of the expensive costs involved with running OpenAI's powerful models, and potentially hedge any eventual disruptions caused by legal challenges facing OpenAI.
A picture worth a thousand annotators - NVIDIA is working on a patent for a system that allows AI models to create their own ground truth data for training and retraining purposes, particularly for object detection in computer vision. The process involves generating predictions of objects from various angles and using an object tracking algorithm to identify where the model failed to detect objects, allowing the model to retrain and label itself without human intervention, or at least minimal oversight to mitigate self-amplifying biases. With this patent, NVIDIA is trying to cut down on the expense and time consumption needed to obtain high-quality visual data, giving researchers on a budget another reason to stick to their ecosystem.
And that’s all for this edition, we hope you enjoyed reading through!
The Unify Dev Team.