The AI community was shaken with the release of Lumiere, Google's latest SOTA text-to-video model with mind-blo...come again?...oh...right, let's try again. Rewinds.
The AI community was shaken with the release of Sora, OpenAI's latest SOTA text-to-video with mind-blowing video quality, extreme realism, and sequential consistency, yet again pushing the limits of generative AI applications! (Jokes aside, gotta feel sorry for Google's awful release timing, tough luck).
The technical prowess remains somewhat of a mystery as the technical report only provides a high level explanation of the methodology, but we can draw parallels with Lumiere from which it seems to borrow some components. If this sounds interesting, feel free to check out our extensive deep dive where we compare both models!
Color Palette
Getting accurate output from diffusion models often requires fine-tuning to achieve the intended result. SegMoE allows you to dynamically combine Stable Diffusion Models into a Mixture of Experts, within minutes and without training, to create larger models with broader knowledge and prompt adherence.
Why would you care? - I mean, merging models on the fly without training is as useful as it can get. The framework is also integrated with HuggingFace's Diffusers so it can be used within their AutoPipeline for generating images or inpainting. Plus, the actual merging is as straightforward as running this code with your custom configuration:
from segmoe import SegMoEPipeline
pipeline = SegMoEPipeline("config.yaml", device="cuda")
pipeline.save_pretrained("segmoe_v0")
How does it work? - SegMoE takes a configuration file that specifies (a) the base model used to generate the initial image, (b) the number of experts to use in the MoE model, (c) the type of layers to fuse, and (d) the experts used to generate the final image including their source model, and positive / negative prompts used to compute the router gate weights.
Taking the config as input, calling the SegMoE pipeline:
Loads the base model, that can be either a base Stable Diffusion or Stable Diffusion XL model.
Sets-up the experts by loading the source model for each expert model and associating each with their corresponding positive / negative prompt pair.
Fuses each expert and their prompt pair with a LoRA model if specified.
Replaces the specified layer type (either feedforward layers, attention layers, or all) with SparseMoE layers. SparseMoE layers are inspired from Mixtral and contain a linear module (router gate) that allocates the input to each expert to perform the computation.
Initializes the router weights based on the computed hidden states of each expert its respective positive / negative prompts
SparseMoE greatly improves fidelity of output images to the prompt, but relies on the knowledge of the experts. This means that it is only as good as the best expert included in the mix. The framework is also still work-in-progress to optimize for speed and memory usage, but can already be used to create custom MoE diffusion models.
Check out the repository to get started.
The Lab
One ring to learn them all
Processing long language and video sequences is essential for models to learn complex relationships that shape our world. Long videos show how events evolve over time, while lengthy text sequences from books can offer additional insights absent in smaller passages.
Training on such vast amounts of data is difficult due to memory cost, computational complexity, and lack of suitable datasets. Researchers from UC Berkeley propose a series of World Models able to process and answer questions on long video and text data involving millions of tokens. The models are trained by:
Extending the text processing context of a base LLM using RingAttention, which leverages block-wise computation with sequence parallelism, combined with FlashAttention to optimize performance, and training on progressively longer contexts to save compute by learning shorter range dependencies first.
Modifying the architecture of the resulting model to incorporate vision input, fine-tuning the new model on vision-language data of various lengths, and introducing specific labels to the data to mark the beginning / end of text / vision generation.
Turns out auto-correcting can be useful
Training on synthetic data from a generative model is a popular approach to efficiently scale datasets, but studies have shown that training a model with enough of its own inputs can lead to model collapse. While this can be mitigated by introducing more human generated data or manually fixing errors in synthetic data, neither approach is scalable given the size of modern datasets.
Researchers from Brown University and Google Research address the issue by introducing a self-correction mechanism to the training pipeline. Training with self-correction involves iteratively fine-tuning the model at each step using a combination of (a) ground truth data, (b) synthetic data generated by the previous iteration checkpoint. At each step, the synthetic data is adjusted by a correction function that corrects its probability distribution to be more or less in line with the ground truth data depending on a parameterized value.
This mechanism ensures weight updates remain stable and don't deviate from the optimal convergence level obtained from training the initial generation from scratch on the ground truth dataset only.
More cutting-edge research:
FAST and straight - Transformers attention scales quadratically with input size, and sub-quadratic alternatives either compromise on accuracy or don't fully remove quadratic scaling. FAST achieves O(N) complexity while keeping Softmax-based product attention accuracy. Given Query (Q), Key (K), and Value (V) matrices of standard attention, FAST: (1) Normalizes Q and K, (2) applies a polynomial kernel function on their product to get the Activation (A) matrix, and (3) breaks-down the product of A and V as the sum of two Taylor series to reduce the number of accumulations and speed-up computation.
(Quality) size matters - Using better data for pre-training LLMs was shown to improve performance, but extensive assessment of different samplers remains scarce. A new research studies the impact of quality vs coverage based sampling on pre-training. Quality sampling, using a proxy LLM to judge the usefulness of datapoints for training, is shown to consistently outperform full-data training with as little as 10% of the original dataset, while converging up to 70% faster.
Professional ranker - LinkedIn released a paper discussing deep learning-based exploration strategies for increasing training and deployment efficiency of large ranking models. They introduce LiRank, a large-scale ranking framework utilizing advanced modeling architectures like residual deep cross network, and optimization methods such as dense gating, transformers, and attention mechanisms. Results show significant improvements in key performance indicators for feed sessions, job applications, and ad clicks.
The Pulse
Customized powerhouse - Nvidia is creating a new business unit to design custom chips for cloud computing companies, including specialized AI processors. Nvidia hopes to assist these companies in developing custom AI chips while also expanding its reach in data centers, telecommunications, automotive, and gaming industries. By offering its technology and IP to partners, Nvidia could pose a competitive threat to existing players like Broadcom and Marvell.
Bigger and denser - ASML, the Dutch manufacturer of chip-making equipment and sole producer of extreme ultraviolet (EUV) photolithography tools, revealed its latest machine, priced at $350 million. The new technology offers improved resolution, enabling smaller feature sizes and increased transistor density on chips. Despite economic constraints, ASML reports strong demand for its products, excluding China due to US restrictions.
Omniglot model - Cohere For AI's research lab, C4AI, has developed a new open-source, massively multilingual, generative large language research model called Aya, which covers 101 languages, doubling the number of languages covered by current open-source models. The model aims to close the gap in languages and cultural relevance in AI research, which has so far focused mainly on English and a few other languages, leaving many communities unsupported.
And that’s all for this edition, we hope you enjoyed reading through!
The Unify Dev Team.