The Deep Dive Issue #29

AgentGym, prompting frameworks, and controversies

Jun 13, 2024

In the race for market dominance, every AI player has their strategy for adoption. Some set the standard, others open source it, or weave it into cross-application interfaces. In their latest event, Apple finally revealed their take on Artificial (sorry, Apple) Intelligence by doubling down on ~~customized emojis~~ in-device integration.

If you’re plugged into the latest tech, their features might look underwhelming given current SOTA. Whether this is just R&D lag or actual thoughtful positioning, it remains an interesting contrast with the frenzied hustle to stay ahead of the loss curve, focusing instead on making the most of matured tech. With that said, only time will tell if the tortoise once again outruns the hare (except there’s a whole pack now, and they’re doped on training data, but small details).

Next week, in collaboration with QDrant, we’ll dive into the shortcomings of traditional naive RAG systems and explore how advanced agentic RAG effectively address and overcome these challenges. Join us for an exciting session through the link below!

Event Link

Testing grounds

Building generalist AI agents that work accurately under various scenarios is complex, even with specialized agent frameworks. AgentGym is a novel framework that focuses on testing agent systems on evolving environments to ensure robustness.

Why would you care? - Most libraries facilitate building agent systems but not so much on testing. AgentGym provides the necessary environments and utilities for that.

Overview of the AGENTGYM framework. Source: Original Paper.

How does it work? - AgentGym offers a platform with diverse environments and tasks, including web browsing and programming for e.g, allowing agents to explore and learn broadly. Each environment is presented as a service that agents can interact with.

The framework can be broken down into three main components:

AGENTEVOL is the core algorithm that enables LLM agents to learn and evolve on their own. It works by having the agent repeatedly explore different environments, try to solve tasks, receive feedback on how well it did, and then use that feedback to update its knowledge and skills.
AGENTTRAJ is a dataset of examples showing how agents can interact with the various environments in AGENTGYM. This dataset is used to train the initial "base" agent by having it imitate these examples, giving the agent a starting point for learning and exploring further on its own.
AGENTEVAL is a benchmark suite used to evaluate how well agents can perform in the diverse environments of AGENTGYM. It consists of a set of challenging tasks and instructions for the agents to complete. The agents are scored based on their success rate and efficiency in solving these tasks.

Check out the repository to get started.

Mental Model

Buffer of Thoughts (BoT) is a novel thought-augmented reasoning framework that leverages a library of reusable, high-level reasoning strategies called "thought-templates".

BoT begins by using a "problem distiller" to analyze the input task and extract key information like variables, constraints, and objectives. Based on this distilled information, BoT then searches its "meta-buffer" for a relevant thought-template. This meta-buffer stores a collection of these templates, each offering a generalized approach to a specific problem type, categorized into areas like mathematical reasoning or code programming.

Once a suitable template is retrieved, BoT uses an "instantiated reasoning" process to adapt the template to the specifics of the current problem, generating a tailored solution. Finally, a "buffer manager" component refines and expands the meta-buffer by distilling new thought-templates from successful solutions, ensuring continuous learning and improvement of the system's reasoning capabilities.

BoT consistently outperforms existing prompting methods across diverse reasoning tasks, achieving significant accuracy improvements while maintaining high efficiency and robustness

Devin may cry

CodeR is a new coding agent system pushing the limits of code generation performance.

CodeR uses a team of five specialized language models working together. These "agents" - Manager, Reproducer, Fault Localizer, Editor, and Verifier - each have specific roles and abilities inspired by how human developers collaborate. The Manager analyzes the issue and selects a pre-defined plan, which is essentially a flowchart guiding the workflow. The Reproducer writes a test to expose the bug, the Fault Localizer pinpoints likely problem areas, the Editor modifies the code, and the Verifier checks if the issue is resolved through testing.

This structured approach, using flowcharts to pre-define possible solutions, allows CODER to tackle complex problems more effectively than single-agent models. Additionally, CODER leverages software engineering techniques like code coverage analysis and fault localization to further enhance its accuracy in finding and fixing bugs.

CODER achieves a new state-of-the-art performance on the SWE-bench lite benchmark, successfully resolving 28% of real-world GitHub issues on the first try.

Back-and-forths

Auto Evol-Instruct proposes a fully automated framework to enhance instruction datasets for LLMs.

Auto Evol-instruct begins with a basic set of instructions and uses an “Evol LLM” to generate more complex versions of these instructions. A second “Optimizer LLM” then analyzes the evolved instructions and identifies any issues, such as unclear wording or illogical progressions in complexity.

Based on these issues, the optimizer LLM refines the method used by the evol LLM to create even better instructions. This analysis and optimization cycle repeats iteratively, with the optimizer LLM learning from past mistakes to guide the evol LLM toward generating a more complex and diverse instruction dataset.

Auto Evol-Instruct successfully automates the creation of complex and diverse instruction datasets, leading to significant performance improvements in instruction following, mathematical reasoning, and code generation.

Share The Deep Dive

The Pulse

GPT with a big A - Among the latest features announced for iOS 18, Apple expects to bring OpenAI's ChatGPT to the OS. Apple reportedly chose ChatGPT over Google's Gemini for better deal terms and general public perception on the performance of the LLMs. ChatGPT, will be used alongside other models (for image generation for e.g) to automate tasks and create content with natural language commands.

Everyone’s a public figure - Meta is facing backlash for its decision to start training its AI models using user data from all over the world, including publicly shared posts on Facebook and Instagram. While European Union users will have the option to opt out of this data collection, users outside of the EU will not have this choice.

Opening pandora’s box - OpenAI researchers have developed new techniques for interpreting neural networks, specifically LLMs, by training sparse autoencoders to identify millions of "features" that represent human-interpretable concepts, including ones related to human imperfection, price increases, training logs, and rhetorical questions. They release their findings, code, and visualizations in this repository.

And that’s all for this edition, we hope you enjoyed reading through!

The Unify Dev team.