It was my first time attending NeurIPS, and it was quite something—to see so many people whose papers you’ve read and whose work you’ve admired, all together under the same roof. It genuinely felt like being in the driver’s seat of the AI buzz we’re going through. So, I have put together a few takeaways from the conference that I would like to share.
Benchmarks for agents and beyond
The very first day was quite interesting with a lot of contrasting ideas. The very first talk was from Apple, checkout their paper The Illusion of Thinking for more background, but the crux of it was how reasoning models performed beyond benchmarks. In their study, they showed how reasoning models behave at games like Towers of Hanoi, Checkers jumping, etc. And discovered something counter-intuitive: at low-complexity games, non-reasoning models perform the same as reasoning models; at medium complexity, reasoning models are better; but at high complexity, both are bad, despite increasing the thinking budget. And, quite interestingly, the number of thinking tokens also does not increase indefinitely; it plateaus once complexity is high enough. The point the sudy made is that performance of reasoning models which are considered to be state-of-the-art for LLMs is severley constrained.
But the same day there was an extensive tutorial on benchmarks, including a very interesting panel discussion with panelists from public-policy, Metr, creators of SWE Bench the coding benchmark, and a few more. It was an extensive disucssion but the following two have stuck with me:
- It is tricky to talk about creating a benchmark. Some have stood the test of time—or, as someone said, are being used far beyond where they should be.
- The panelist from Meter also expressed some displeasure about ‘exotic’ benchmarks like ARC-AGI, which he claimed (I’m paraphrasing) do not correspond to real-world applications and are instead someone’s idea of what intelligence could be—spatial reasoning in this case. I have often, been annoyed preparing for these exams that give you riddles testing what they call spatio-temporal reasoning. Is that real intelligence? Is that even a good proxy?
That inevitably made me think of the Apple talk in the morning, which now also seemed like an ‘exotic’ game chosen as a proxy for intelligence. The authors were aware of this and have responded to critics, but the justification felt less convincing when viewed from this angle. The strategy seems to be: choose a game the LLMs have likely never seen during training and then extrapolate performance there to draw conclusions about general reasoning capabilities. But yes, I do understand that intelligence itself is an ill-defined concept and this evaluation problem will always be slippery. Nevertheless, these studies and ‘exotic datasets’ are interesting because they shed light on new domains for LLMs and who knows where the next breakthrough comes from.
Apart from these highly philosophical discussions, there were some very practical trends around benchmarking agents. We are all aware of the limitations of traditional metrics, the overly homogeneous test sets, data drift, and so on. But a new, dynamic testing framework is emerging with agents: generating new test instances during evaluation, keeping the test set alive instead of static. The key metric becomes whether the agent accomplishes the task, however you want to quantify that, instead of the accuracy of components like word-error rate or F1 score. A clear shift from what I’d call academic metrics to more product-centric ones—nobody should be unhappy about that. One of the best examples came at the end of the day from Tesla’s driving agent. They built a world model from the millions of miles of driving data they have, and they can edit the inputs to this world model using natural language to create new scenarios on the road to test their self-driving agent. A closed-loop simulator for testing and training agents. Honestly the best production-grade example I saw.
BeeAI framework from IBM also fits neatly into this new paradigm of dynamic evaluation, though I’ll write more on that separately.
Coding agents
Research and industry both recognize the limitations of current coding agents when it comes to production-quality code. Everyone acknowledges the issues with SWE-bench, and they admit that getting real production data for coding benchmarks is still a major bottleneck. If that’s the bottleneck, then are we just one step away from Microsoft (owning GitHub) or Google—with their armies of engineers locking down the advantage? Maybe. But Claude Code doing extremely well keeps me hopeful that we’ll see meaningful competition rather than consolidation.
I see a new trend emergin as well, a huge proportion of papers on coding agents revolve around clever prompting strategies and decomposing problems into flows that might help the agent. But isn’t that just what all of us do anyway? You sketch a high-level plan before solving anything. It reminds me of a few years ago when ML use exploded and thousands of papers emerged where people applied standard ML techniques to a new domain, and master’s theses consisted of doing a giant grid search. Mine was the same, by the way ;)
A cute piece of work was Paper2Code which talks about the forever needed task of generating code from research papers to replicate their studies or buidl them further. Something we’ve all desperately needed, so kudos to them.
IBM’s ALICE - Agentic Logic for Incident and Code bug Elimination, is another one I’m proud to see progress. It brings together multi-agent collaboration, incident analysis, and code-level reasoning to help teams diagnose complex system issues faster and more intelligently. This sits right at the intersection of AI agents, observability, and software automation—an area that’s evolving ridiculously fast.
Efficient AI
There was a lot happening on Efficient AI. The most intriguing work was from companies like Eigen AI, Pruna AI, Furiosa AI, and Qualcomm (check out their demo on disaggregated LLM serving on AI accelerators). A lot of practical solutions for energy saving, compute optimization are finally making their way into the market with an ever crunching demand for chips.
As usual Han Lab stood out with: Jet-Nemotron and radial attention—a sparse attention mechanism for long video generation. Sparse models are clearly becoming the theme again, but this time in they are expanding into and modalities as well.
Wrapping up
As always, a lot of the best parts of NeurIPS happen outside the sessions. The side conversations were rich, people comparing notes on agent evaluations, debugging LLM behaviors in the wild, trying to make sense of the rapid shifts in hardware, and the shared feeling that we’re building tools faster than we can define what “good” looks like. It’s chaotic, but in the fun way.
And yes, I left with more questions than answers, but that’s the whole point of going.