Google tested LLMs on superconductivity questions. The results are telling.

Google Research just published a paper in PNAS that I find genuinely interesting—not because it’s flashy, but because it’s honest about what LLMs can and can’t do in science.

The team, led by Subhashini Venugopalan and Eun-ah Kim, took high-temperature superconductivity as a case study. This is a field that’s been wide open since the Nobel-winning discovery in 1987, with thousands of papers, competing theories, and no consensus on the underlying mechanism. Perfect stress test for an LLM claiming to be a research partner.

They asked six different LLMs challenging questions about cuprates—those copper-based compounds that conduct electricity with zero resistance at around -140°C. A panel of domain experts graded the responses on accuracy, comprehensiveness, and how well they handled the nuances of open scientific debates.

The results? Two systems stood out: NotebookLM and a custom-built tool. What did they have in common? Both drew from a closed ecosystem of certified, quality-controlled sources. Not the open web, not random arXiv preprints—curated references.

This is higher than I expected for NotebookLM, honestly. I’ve used it for personal research and found it decent, but seeing it outperform general-purpose models on expert-level physics questions is a data point worth remembering.

The other four models? They struggled. Not because they couldn’t generate coherent text, but because they couldn’t navigate the ambiguity that defines real science. When a field has multiple competing theories, a good research partner needs to present them fairly, not collapse into one confident-sounding but wrong answer.

I’ve seen this pattern before. In 2023, Google’s own CURIE benchmark showed LLMs could handle basic analytic tasks across six scientific disciplines, but that was about regurgitating facts and simple analysis. This new work raises the bar: can an LLM act as a thought partner when the ground truth is unknown?

The answer right now is: barely, and only with tight guardrails.

What I appreciate about this study is that it doesn’t pretend otherwise. The paper identifies clear areas for improvement—handling contradictory evidence, citing sources properly, and knowing when to say “we don’t know.” These aren’t easy problems, but they’re the right ones to focus on.

There’s also a practical takeaway here for anyone using LLMs for research: the quality of the source material matters more than the model size. A smaller model fed with curated, peer-reviewed content can outperform a larger model scraping the entire internet. That’s not surprising to anyone who’s tried to fact-check an LLM’s citations, but it’s good to see it quantified.

Other groups at Google are pushing in similar directions—using AI to generate hypotheses, write expert-level software, or analyze single-cell data. But this superconductivity study feels more grounded. It’s not promising a revolution; it’s showing where the current tools fall short.

If you’re a grad student trying to get up to speed on cuprates, or a physicist exploring new directions, this paper tells you what you can trust and what you can’t. And that’s more useful than another breathless press release about AI transforming science.

Google tested LLMs on superconductivity questions. The results are telling.

Comments (0)