Schedule
| Time | Type | Title & Speakers |
|---|---|---|
| 8:50 - 9:00 | Opening Remarks | Organizers |
| 9:00 - 9:45 | Invited Talk | Test-Time Scaling, A Foundation for AI Self-Improvement Azalia Mirhoseini (Stanford University) Pre-training scaling laws have driven much of the progress in AI over the past few years. In this talk, we present test-time compute as a new frontier for AI scaling and self-improvement. We have entered an era where models themselves are powerful sources of intelligence that can indefinitely synthesize new experiences and strategies through thinking, reasoning, and interacting with external environments and tools. Two important enablers of model self-improvement are test-time compute scaling and training on experiences generated at test time. Building on these, we will discuss our recent work on AI self-improvement, including KernelBench, Weaver, SWiRL, and Cartridges, demonstrating why test-time scaling for reasoning represents a significant and largely untapped frontier for general artificial intelligence. |
| 9:45 - 10:00 | Contributed Talk 1 | Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning Lucas Dionisopoulos, Prithviraj Ammanabrolu, Nicklas Majamaki |
| 10:00 - 10:15 | Contributed Talk 2 | GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, et al. |
| 10:15 - 11:30 | Poster Session 1 | |
| 11:30 - 12:15 | Invited Talk | Exploration, Extrapolation and Chains of Thought Aviral Kumar (Carnegie Mellon University) In this talk, I will talk about the central question of exploration in LLM reasoning. I will talk about two notions of exploration: at train time, and at test time. Train time exploration enables us to solve challenging questions where sampling from the base exploration itself is not enough by steering chains of thought towards useful parts of the search space. On the other hand, test-time exploration enables us to implement useful algorithmic strategies in chain-of-thoughts that improve extrapolation and scaling of compute at test time. Taken together, these two axes help us make better use of computation for scaling LLM reasoning. |
| 12:15 - 13:45 | Lunch break | |
| 13:45 - 14:30 | Invited Talk | Olmo 3 Think: Training a Fully Open Reasoning Model Nathan Lambert (Allen Institute for AI) This talk covers the crucial details it takes to train 7B to 32B parameter, fully open reasoning models, Olmo-3-Think, to rival Qwen 3, highlighting fresh results, trade-offs, and methods across midtraining, distillation with high-quality thinking SFT data, and reinforcement learning with verifiable rewards. This talk focuses on aspects of the training process, such as model architecture decisions, data sourcing, and training code design that is often not shared by leading models and can enable a resurgence of research with advancements in reinforcement learning, tool-use, and inference-time scaling. The goal of these models, such as through our release of various case-studies in RL-Zero checkpoints, is to seed trusted and innovative reasoning and tool-use research. |
| 14:30 - 14:45 | Contributed Talk 3 | OpenThoughts: Data Recipes for Reasoning Models Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, et al. |
| 14:45 - 15:00 | Contributed Talk 4 | Learning Composable Chains-of-Thought Fangcong Yin, Zeyu Leo Liu, Liu Leqi, Xi Ye, Greg Durrett |
| 15:00 - 16:15 | Poster Session 2 | |
| 16:15 - 17:00 | Invited Talk | Understanding Architectural Constraints on LLM Reasoning Abilities Michael Hahn (Saarland University) The reasoning capabilities of LLMs have seen enormous progress, but it remains hard to predict when they fail, and how many reasoning tokens they need to solve different problems. I will present two lines of research aiming to make reasoning abilities more predictable via theoretical bounds on the abilities of the underlying architecture — the Transformer. First, I will present our recent work aiming to predict on which algorithmic tasks transformers can generalize to longer inputs, and compare to LLM performance. Second, I will describe our recent work bounding the reasoning cost needed to solve various algorithmic problems with transformers. I will close by discussing problems for further research. |
| 17:00 - 17:15 | Closing Remarks |