Hi there 👋
Welcome to the inaugural edition of "The Deep Dive", the newsletter where we explore the messy world of AI deployment, with practical insights and (hopefully) some useful tips and tricks for navigating this ever changing landscape 🧑💻. If you're a developer with a passion for machine learning 🤖, but with a disdain for the complex and fragmented ecosystem 🫠, then you've come to the right place (or maybe the wrong place… depending on how much you really hate it)
As a team of fellow engineers frustrated with the ongoing complexity, we felt the need for clarity. Each week, we will therefore dive deep 🤿 and bring you the most interesting practical developments in AI deployment. At the end of each issue, we'll also include separate sections for the latest news including:
The Lab - Byte-sized summaries of the latest in bleeding-edge deployment research
The Pulse - Your weekly updates in market events shaping AI deployment
Before you dive into the first issue, we’d like to announce our new series of blog posts on the deployment stack. Check out the link below if you’d like to get an overview of the space! We’ll be releasing a new post every week as part of our AI deployment series!
Deploy Your Models With PyTorch, No, Really.
Recently, PyTorch announced new tools in its ecosystem designed to assist in deploying PyTorch code at scale (about time!). ExecuTorch is one such tool and aims to address previous issues that made edge deployment using PyTorch inefficient, or impossible.
At a high level, ExecuTorch is an end-to-end solution that enables on-device inference across mobile and edge devices, making it essentially a more robust version of PyTorch Mobile.
Let’s take a moment to break-down the various components involved. Given a PyTorch model,
ExecuTorch converts the various modules into an edge-compatible intermediate representation using torch.export which takes any arbitrary python callable as input and generates a traced graph that captures the Tensor computations of the function before execution. There’s actually multiple intermediate representation conversions happening under the hood with optimizations taking place to make the model runnable on edge devices, but let’s not get too technical.
Torch.compile is used to compile the graph into an ExecuTorch executable format that can be used for running inference. Torch.compile makes use of TorchDynamo efficiently captures PyTorch graphs, TorchInductor to accelerate PyTorch code on various hardware, and AOT Autograd which captures backpropagation ahead-of-time as well.
Besides the streamlined process it offers, one of the major benefits of ExecuTorch is that it defines a list of standardised operators, which makes it simpler to integrate by third-party operator libraries and accelerator backends.
ExecuTorch has also received backing from industry leaders like Meta, Arm, Apple, and Qualcomm. So if big tech is on-board, it might be worth updating your workflows!
The Lab
Just a few nodes to predict many - Researchers from Stanford, Illinois Urbana-Champaign and Google recently proposed Graph Segment Training (GST) to predict the properties of large graphs, using limited computational resources. GST lets you train GNNs on large graphs while keeping a constant memory footprint. At a high level, this works by dividing the graph into smaller segments, a random sample of which is used to update the model at each training step such that only intermediate activations for a few segments are maintained for backpropagation. With regards to AI deployment, GST enables estimates of how compiled code will perform without needing to actually compile it, which opens the door for optimal search algorithms such as reinforcement-learning (RL) to effectively search the huge compiler search space to find optimal configurations. If effectively integrated into larger compiler systems, it could mean no more throwing darts in the dark, hoping to get the runtime bullseye.
Spice up your compute diet with stats - Typical approaches to deal with the large power consumption requirements of DNNs when running on edge include compressing models and using compute-in-memory architectures, but issues remain with both approaches as the former leads to accuracy loss and the latter is still far from the commercial mass-usage stage. This new paper shows that applying low-power coding on weights based on the bit-level distributions of weights and activations of a DNN can reduce power consumption when running on edge by over 80%, without accuracy loss and no noticeable hardware costs. Practically, this allows for deploying models on lower-end devices without compromising accuracy!
The Pulse
Green ML is also a thing - IBM Research has developed a new NorthPole AI chip that is designed to be energy-efficient and fast for running AI applications. The chip is inspired by the human brain and features a two-dimensional array of memory blocks and interconnected CPUs. A key difference with NorthPole is that all of the memory for the device is on the chip itself, rather than connected separately, “blurring the boundary between memory and compute” to rephrase IBM Research’s Dharmendra Modha words. With NorthPole, energy efficiency doesn’t need to rhyme with performance deficiency anymore, and we’re excited to see what new perspectives this will open both in terms of research and AI applications.
NVIDIA is king of GPUs, and dodgeball - The US recently rolled a new set of restrictions on GPU exports to China, intended to severely narrow-down the scope applicability for AI uses. Yet, NVIDIA still manages to come up with not 2, but 3 altered versions of their flagship hardware, namely the H20, L20, and L2 GPUs which are set to start production next month. Putting icing on the cake, one of these new GPUs is 20% faster than the previous regulation-adjusted H100! While NVIDIA expects sales to “decline significantly” in China this quarter, they remain determined to serve their Chinese customers with continuous innovation. What do you think, will US restrictions eventually bar all strategic compute exports to China?
Azure right beneath your keyboard ? Microsoft just announced they will be releasing not one but two AI chips in 2024, adding to the lot of anticipated new chips from competing companies set to roll-out next year. The Cobalt CPU is designed to power general cloud services on Azure while the Maia 100 AI accelerator is intended to efficiently run cloud AI workloads and is currently being tested on ChatGPT 3.5 Turbo. With these releases, Microsoft is reaffirming its intent on being a key supplier of compute for the AI industry, and this looks like it’s just the beginning of their roadmap.
And that’s all for the first edition, we hope you enjoyed reading through, have a great week!
The Unify Team