Latent Space: The AI Engineer Podcast cover

Latent Space: The AI Engineer Podcast · July 31, 2025

The RLVR Revolution — with Nathan Lambert (AI2, Interconnects.ai)

Highlights from the Episode

Nathan LambertAI2, Interconnects.ai
00:01:34 - 00:05:37
RLVR: Industry Post-Training Recipes
The goal is to compress complicated industry post-training recipes into something manageable. This allows you to modify them and perform post-training at a state-of-the-art level. Compared to Frontier Labs, we likely have fewer tasks. Our post-training suite for Tulu probably includes 10 to 15 tasks. However, post-training at OpenAI involves hundreds of evaluations. Adding more evaluations requires extensive data work and careful mixing to ensure everything is properly integrated.
Nathan LambertAI2, Interconnects.ai
00:15:47 - 00:17:43
RLHF vs. RLVR: Foundational vs. Evolving
Ultimately, RLVR is not mature enough, nor is it as interesting of a book. Those are the two main reasons I don't want to rebrand. There's also some personal career strategy involved, but that should be independent of what is objectively a good book. RLVR will change significantly in the next 18 months. We've already seen new algorithms, but I believe there's much more to come regarding proper pre-training, data, and how tool use emerges. All these factors are central to how RLVR will be perceived. I'm watching to see if O3 becomes a niche model or the standard path everyone needs to follow, especially with its unique style of tool use in search.
Nathan LambertAI2, Interconnects.ai
00:38:09 - 00:39:50
Model Behavior: Overthinking and Calibration
My foundational point was skills, which we've addressed with 0, 1, and 1. This involves extensive reinforcement learning, demonstrating inference time scaling, and achieving high benchmark numbers. The next three points focus on planning. My list included abstraction and strategy. I prefer not to use the term "planning" because it's overused. Strategy defines the model's direction and the technical steps of its plan. Abstraction involves breaking down problems into solvable components. The final point is calibration: efficiently managing compute resources and knowing when to disengage or prompt the user. Overthinking is a clear issue.
Nathan LambertAI2, Interconnects.ai
00:55:07 - 00:55:52
Over-optimization in RL: Historical Context
The three-part breakdown helps people understand historical events. These over-optimizations occur because the model optimizer is powerful enough to manipulate the agent within its environment, or manipulate the environment itself, in a way that benefits its target signal. For context, with language models in reinforcement learning, if something can increase its reward signal, it will choose the easiest and most direct path to do so. This aligns with what I mentioned on Scoancy: the reward model for user feedback was likely obvious, as humans tend to favor content that is frequently engaged with.
Nathan LambertAI2, Interconnects.ai
00:56:49 - 00:58:31
RLVR Over-optimization: Cheating and Reward Design
It's a very artificial environment, so it makes sense that these actions, which are generated tokens, will sometimes reduce to just repeating one token. For example, at Hugging Face, we saw a model that would just say "JavaScript, JavaScript, JavaScript." This is obvious when you see it, but harder to detect when making decisions on when to stop training, especially with extensive RLHF. We're now in the RLVR phase, where we reward the model for doing something "right." For math, it's harder to over-optimize unless the model learns to cheat by searching for answers instead of solving problems. For instance, a model might recognize a problem set it's seen before and simply retrieve the solution manual. This is easier to "fudge" with code or information retrieval. The easiest way to pass a unit test is to just insert a "pass" statement. It's not surprising a model could learn this. For code, you need more sophisticated reward design to balance understanding with avoiding failures or over-optimizing for test cases.
Nathan LambertAI2, Interconnects.ai
01:03:24 - 01:03:37
Model Spec: Transparency and Intentional Behavior
This is real because of its developmental benefits. It shows where your model is heading and offers regulatory advantages. It's crucial to distinguish between intentional behavior and a training error. For model transparency, this is fantastic. I've always said the model specification is far more useful than a constitution. A constitution is an intermediate training artifact given to the training algorithm to achieve the desired model. We don't typically document our model goals in a constitutional format.
Nathan LambertAI2, Interconnects.ai
01:04:26 - 01:05:31
Open Models: Personality and Personalization
My main point is that there hasn't been a strong foundational research paper on this topic. It's a significant undertaking. This also touches on how personalization and personality are similar. If open models are to succeed, it could mean everyone gets the exact model they desire. For example, we're currently using GPT 4.5. While you can prompt it, if fine-tuning proves more effective than prompting, then everyone can truly have their preferred model. It's an academic or open ecosystem problem where people are competing in areas they feel more likely to win, which is positive.

Get weekly highlights

Subscribe to get the best podcast highlights delivered to your inbox every week.

00:00:0000:00:00