We’re living in interesting times where corporate events can flow faster than developments in AI research. In this edition, we’ll be covering some of the latest deployment news you may have missed if you’ve been submerged by the extensive news coverage of the recent OpenAI drama.
But before you dive in, we’ve just released a new blog post on Ivy, our open-source framework that unifies popular ML frameworks for accelerated development and deployment. If you’ve always been overwhelmed with the variety of frameworks, you know where to go!
Wanna move fast? Look ahead but mind your past.
Transformers have become a ubiquitous architecture for many AI applications, notably text generation. Yet, we are still used to LLMs processing tokens one after the next, despite being built on top of powerful GPUs that excel in parallel computing. A novel approach dubbed Lookahead Decoding (LAD) aims to disrupt this pattern by simultaneously generating text at multiple future locations and significantly reducing latency as a result.
Why would you care? Results from this approach show a consistent ~1.5x / ~2x speed-up on LLMs of various sizes, and because both window and n-gram sizes can be tweaked, you could easily customize LAD for your own latency needs. In fact, you can get started right away to accelerate your text generation as this is now compatible with Hugging Face’s transformers library! It’s as simple as:
pip install lade
import lade
import transformers
# Additional lines
lade.augment_all()
lade.config_lade(LEVEL=5, WINDOW_SIZE=7, GUESS_SET_SIZE=7, DEBUG=0)
# Typical transformers workflow
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=torch_device)
model_inputs = tokenizer(input_text, return_tensors='pt').to(torch_device)
greedy_output = model.generate(**model_inputs, max_new_tokens=1024)
How does it work? Isn’t decoding based on assessing conditional probabilities of the next token given previous ones? How can you generate tokens that come way later in the sequence when you don’t have historical input to work with? I hear you ask with a hint of poorly hidden skepticism. Well, let’s think step by step as usual and break down the main building blocks.
Concurrent generation: Instead of following an autoregressive decoding approach, fancy way to say generating the next word based on the previous ones, LAD relies on the Jacobi method which solves nonlinear equations by guessing the answer and iteratively refining until it converges to the solution. This technique is used at each text generation step to generate a whole sequence of words at once and verify them successively to check whether they match the word expected to follow the previous one. If it’s a match we verify the next guess in the sequence, if not we use the set of verification words as the new guess sequence, rinse and repeat.
Enters N-Grams: Cool stuff, right? Except Jacobi Decoding alone isn’t enough and often fails to achieve efficient convergence. LAD mitigates the slow convergence by taking advantage of the fact that word associations, or n-grams, are continuously made to verify the guessed sequences. At each sequence updating step, LAD searches through these historical connections to refine the words it has already seen with the corresponding historical word succession. This additional process allows generated sequences to follow existing patterns instead of relying on one-period data and increases the likelihood of making multiple correct guesses, further reducing convergence time.
Hold on, this feels familiar! Yeah, but not quite. If you're thinking of speculative decoding it does generate full sequences of tokens as well, but because it involves two models working in parallel with one making guesses and the other verifying them, the interaction between both models adds friction to the process such as how accurately the draft model can predict the main model's outputs for example.
If you want to read more, check out the original blog post which goes into more details, and don’t forget to give the repo a look!
The Lab
Teamwork for the win - Researchers from Dodners Institute, Radboud University, New York University and DeepMind recently explored Game Theory Assisted Pruning (GTAP) which assesses the joint impact on prediction quality of several neuron combinations to determine the appropriate pruning scheme. GTAP measures prediction uncertainty based on a sample of dropout-adjusted sub-networks to (a) estimate the size of a critical sub-network (one for which removing any neuron highly increases uncertainty), (b) use this estimate to construct an appropriate power index (used in game theory to fairly assess the contribution of an agent to a team), and (c) determine the best combination of neurons to keep based on each neuron's power index value. Beyond increased pruning efficiency, their approach provides a systematic process for trimming-down networks which contrasts with layer-wise or network-wise random pruning typically used in practice.
Edge and Cloud can be friends too. Cloud models generally provide high performance but fail to adjust to evolving conditions given their pre-deployment training, while edge systems can adapt to changing local data but remain constrained in terms of power. Based on this premise, researchers from Shanghai Jiao Tong University and Huawei Noah’s Ark Lab recently explored the joint use of edge and cloud systems for better model adaptation to dynamic environments. In this research, ECLM is presented as an intuitive approach to design collaborative cloud-edge systems whereby cloud infrastructures continuously deliver bespoke sub-models to the edge device which, in turn, re-upload the sub-models to the cloud back for integration after being adjusted for recent data and compute requirements. This novel design offers a unique framework for efficient deployment of large models over heterogeneous devices that improves both model accuracy and latency at scale.
The Pulse
AI in your phone, no not Siri. Qualcomm’s latest Snapdragon 7 Gen 3 was just announced with a clear focus on AI applications. Natively powering models up to 10-13B parameters, the chips should allow users to run such models as Whisper or LlaMa 2 locally within their phones, further increasing the accessibility of the SOTA to the general consumer. This fits within the broader recent trend of emerging small(er) models being run directly on the users’ devices, which could help alleviate the demand burden on cloud compute and increase the prospects of edge deployment.
Efficient AI for IoT, within an Arm’s reach? Arm recently announced the release of the Arm Cortex-M52 processor to power AI on IoT devices. The new addition to their Cortex-M portfolio delivers up to 5.6x performance model improvement and up to 2.7x performance boost on digital signal processing compared to previous generations, allowing for deployment of more compute intensive models on low power embedded devices without a dedicated NPU, and unlocking more AI applications on connected devices as models grow in size and power!
No seriously, who said there’s not enough GPU? Following a series of custom AI chips announcements, it’s now Amazon’s turn to announce Trainium2, their brand-new AI chip set to compete with the flurry of recent rollouts from rival companies. The new chip is notably expected to be used by Databricks and Anthropic, with the latter planning to train its next models with it. The announcement adds-up to the other major chip releases anticipated throughout next year, adding more supply to a market largely dominated by demand until recently, and making access to computing power more accessible to the masses.
And that’s all for the second edition, we hope you enjoyed reading through, have a great week!
The Unify Team