The conversation about artificial intelligence typically orbits around model architecture, parameter counts, and computational power. But according to Sushant Mehta, a Senior Research Engineer at Google DeepMind and Senior IEEE panel reviewer, who leads quality efforts for Gemini’s coding capabilities, the industry has been looking in the wrong direction.
“Everyone fixates on making models bigger,” Mehta says, “But the real competitive moat isn’t size. It’s knowing exactly what data goes into your model, and having the discipline to make every data point count.”
It’s a contrarian stance in an industry obsessed with scale, but Mehta’s track record lends weight to his argument. As the architect behind Gemini’s Data Analysis Agent, launched at Google I/O 2024, and the quality lead for the Gemini GitHub integration, he’s turned data curation from an afterthought into a strategic tool. The results speak volumes: the Data Analysis Agent now handles 1.49 million requests, has processed over 20 million files, and maintains a 75% user favorability rating, metrics that reflect not just capability but consistent reliability.
Walk into any AI lab and you’ll find researchers debating model architectures and training techniques. What you won’t typically find is someone who has personally curated and “goldified”, as Mehta calls it, over 20,000 supervised fine-tuning and reinforcement learning data points across 42 distinct tasks. This meticulous, almost obsessive attention to data quality has become Mehta’s signature.
“Goldification isn’t just about labeling data correctly,” he explains. “It’s about understanding the loss patterns, identifying where models systematically fail, and then working backward to create training examples that address those specific failure modes. It’s detective work as much as it is engineering.”
This approach crystallized during a critical sprint before the Gemini I/O launch. Mehta noticed the model was struggling with code visualization tasks, specifically scenarios where users wanted to generate charts from data or recover from code errors. Rather than throwing more generic training data at the problem, he did something unusual: he wrote what he describes as “the first comprehensive SFT guidance document covering all major code error scenarios.”
The document, which runs to dozens of pages and has become required reading for DeepMind’s post-training teams, doesn’t just catalog error types. It provides a systematic methodology for collecting targeted data, working with expert annotators, and validating improvements. More importantly, it established a repeatable process that other teams could adopt.
The impact was immediate. Code error rates dropped by over 43% in certain benchmarks. More tellingly, the improvements held up in production, with users repeatedly engaging with Gemini 54% more frequently—a metric that reveals genuine utility rather than novelty.
Partnering with the Frontier
If data quality is Mehta’s obsession, vendor partnerships are his force multiplier. Unlike many AI researchers who treat external data providers as mere contractors, Mehta has built deep collaborations with frontier annotation companies like Turing, Scale AI, and Surge, treating them as strategic partners in the quest for data excellence.
“These companies employ some of the world’s best software engineers and domain experts,” Mehta notes. “If you engage them properly, they’re not just labeling your data, they’re helping you understand what good looks like at the edge cases.”
His work with Turing, detailed in a public case study the company published, exemplifies this approach. Rather than simply sending coding tasks for annotation, Mehta embedded himself in their workflow—helping design annotation interfaces, creating detailed rubrics for evaluating code quality, and establishing feedback loops that improved both the data and the evaluation criteria.
The collaboration yielded something unexpected: an outcome-based reinforcement learning methodology that achieved better performance than traditional approaches despite using zero human annotations in the pilot phase. By carefully structuring the reward signal around code execution outcomes rather than subjective quality assessments, Mehta’s team demonstrated that well-designed systems could sometimes surpass expensive human evaluation.
“The key insight was that code is one domain where ground truth is knowable,” Mehta reflects. “Either the code runs correctly or it doesn’t. We could use that binary signal to generate preference pairs automatically, then use those to train models that generalized far beyond the specific examples.”
This outcome-based approach is now being expanded across other verticals within DeepMind, with Mehta consulting on implementations for everything from PDF analysis to spreadsheet manipulation. His influence extends beyond DeepMind’s walls: as a featured speaker at the 6th Annual MLOps World GenAI Summit 2025, Mehta shared these insights with a global audience of researchers and practitioners, emphasizing how outcome-based reinforcement learning and deep vendor collaboration can redefine data quality in the age of generative AI.
From Theory to Production: The Evaluation Gap
Academic AI research and production AI systems exist in parallel universes. In academia, a model that achieves state-of-the-art performance on a benchmark is celebrated. In production, a model that fails once in every thousand interactions is a crisis.
Mehta has lived in both worlds. He maintains deep connections to academic research. His co-authored paper, “Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models,” explores fundamental architectural innovations that improve the performance-efficiency trade-off. In 2026, he was invited to serve as a judge at the Fourteenth International Conference on Learning Representations (ICLR)—a role that underscored his standing at the intersection of cutting-edge research and real-world deployment.
But his day job demands a different kind of rigor. When you’re shipping features to hundreds of millions of users, you can’t wait weeks for human evaluators to assess whether your latest model iteration is actually better. You need automated systems that can deliver human-quality judgments at machine speed.
Enter AutoRater, the evaluation infrastructure Mehta helped design and scale across DeepMind’s coding efforts. The system uses AI to evaluate AI outputs, comparing model responses against carefully curated gold standards and providing nuanced assessments that go beyond simple pass/fail judgments.
“We’ve reached the point where AutoRater agreement with human evaluators exceeds 98% on most tasks,” Mehta shares. “That means we can run hundreds of experimental training runs and get reliable quality signals within hours instead of weeks. It’s transformed our iteration velocity.”
The numbers back this up. In Q4 of last year alone, Mehta’s team ran over 500 AutoRater evaluations to hill-climb model quality, identifying and fixing issues that would have been impossible to catch through manual review. Each evaluation cycle that previously took 1-2 days using human annotators now completes in 3-4 hours, with estimated cost savings of $30,000 to $40,000 per cycle.
But the real value isn’t just speed or cost, it’s the cultural shift it enables. Teams that once treated evaluation as a final gate before launch now treat it as a continuous feedback loop throughout development. Experiments that would have been too expensive to run are now routine. Edge cases that would have been discovered by users are now caught in testing.
Privacy Lessons from Maps
Mehta’s methodical approach to quality wasn’t born at DeepMind. It was forged during his previous role at Google Maps, where he led the development of privacy-preserving personalization systems that had to balance relevance with regulatory compliance.
The challenge was stark: European regulators had implemented the Digital Markets Act, requiring that personalization systems respect user privacy by default. For Maps, which relied heavily on location history to provide useful recommendations, this created an existential tension.
Mehta’s solution, On-Device Location History, kept sensitive user data exclusively on personal devices rather than in the cloud. But that was just the beginning. The harder problem was training machine learning models that could still deliver personalized recommendations without ever seeing individual user data.
His answer combined differential privacy, a mathematical framework that adds calibrated noise to protect individual data points, with federated learning techniques that allowed models to learn from aggregated patterns across millions of devices. The system he built now powers personalization for billions of Google Maps sessions monthly, achieving engagement improvements of 0.24% to 0.44% across key metrics while simultaneously reducing infrastructure costs by the equivalent of four full-time engineers annually.
“That experience taught me something crucial,” Mehta reflects. “Constraints don’t just force trade-offs, they force creativity. Having to build systems that respect privacy made us better engineers. The same principle applies to data quality. When you commit to only using high-quality data, you can’t paper over problems with volume. You have to actually solve them.”
The Compounding Returns of Systematic Rigor
Six months into his tenure at DeepMind, Mehta received feedback from his manager that captures his approach: “Sushant is a dependable teammate who participated in several critical Code Yellows.” In Google’s terminology, a Code Yellow is an all-hands response to a production crisis. Most engineers try to avoid them. Mehta collects them like badges of honor.
“Those are the moments where theory meets reality,” he says. “A Code Yellow means users are experiencing real problems right now. You can’t philosophize about the ideal solution. You have to fix it, understand why it broke, and make sure it never breaks that way again.”
This mindset, part firefighter and part systematic optimizer, runs through all his work. When PDF analysis quality was languishing at 30%, Mehta didn’t just push for more training data. He conducted a systematic loss-pattern analysis, identified the top error modes, worked with Scale AI to collect targeted examples addressing those specific failures, and drove the quality up to 57.9%. When non-standard dataset performance was stuck at 28.6%, he applied the same methodology and pushed it to 77.6%.
These aren’t marginal improvements. They’re the difference between a feature users tolerate and one they rely on.
The systematic approach extends to his team collaborations. Mehta maintains detailed playbooks, comprehensive documentation, and clear escalation paths. When other teams at Google want to implement AutoRater, they don’t start from scratch, they use the self-serve playbook Mehta created. When new engineers join coding quality efforts, they don’t flounder through tribal knowledge, they read his SFT guidance documents and loss-pattern analyses.
“Infrastructure isn’t just code,” Mehta argues. “It’s knowledge. It’s process. It’s making sure the next person doesn’t have to rediscover everything from scratch.”
What Dependable AI Actually Looks Like
As AI systems move from impressive demos to mission-critical tools, the gap between capability and reliability becomes the defining challenge. A model that correctly analyzes spreadsheets 95% of the time isn’t good enough when users depend on it for financial decisions. A code generation system that works beautifully on simple examples but fails unpredictably on complex tasks creates more frustration than value.
Mehta’s contribution to addressing this challenge isn’t a single breakthrough or clever algorithm. It’s something more fundamental: a systematic methodology for building reliable systems on top of powerful but unpredictable models.
“The next phase of AI isn’t about bigger models,” he argues. “It’s about dependable ones. Models that know their limits, fail gracefully, and improve predictably. That requires different engineering disciplines than what got us here.”
Those disciplines, in Mehta’s formulation, include obsessive data curation, automated evaluation systems that enable rapid iteration, strategic partnerships with domain experts, and a culture where quality isn’t inspected in at the end but built in from the beginning.
The proof, as always, is in production. The Gemini features Mehta has shepherded to launch, from data analysis to GitHub integration, aren’t just technically sophisticated. They’re reliably useful, as evidenced by metrics showing users returning repeatedly and integrating these tools into their daily workflows.
Press coverage from outlets including The Verge, TechCrunch, and PC World has highlighted these launches as significant advances in practical AI utility. But perhaps the more telling recognition comes from within Google, where Mehta has received multiple peer awards and a $200,000 stock bonus for his contributions, recognition that in a company of Google’s scale and talent density, signals exceptional impact.
The Road Ahead
As AI capabilities continue to advance, the gap between research breakthroughs and production reliability will only grow more critical. Mehta is already working on the next generation of challenges: extending outcome-based learning beyond code to domains like document analysis, building evaluation systems that can assess creativity and helpfulness rather than just correctness, and developing frameworks for AI systems that can collaborate with humans rather than just assist them.
“We’re still in the early innings,” he says, with the understated confidence of someone who has shipped features to hundreds of millions of users. “The models will keep getting more capable. The question is whether we can keep making them more trustworthy at the same pace. That’s not a model architecture problem. It’s an engineering culture problem.”
If Sushant Mehta’s track record is any indication, it’s a problem that yields to systematic rigor, strategic partnerships, and an almost stubborn belief that quality, not just capability, is the foundation of AI’s future. In an industry racing toward ever-larger models, he’s proving that sometimes the most important innovations happen in how you prepare the ground, not how tall you build the tower.
