New Research Aims to Curb Deception in AI Systems

February 26, 2026

Authored by:

Sami Pabley

As artificial intelligence (AI) capabilities advance rapidly, human supervision of AI systems becomes increasingly difficult, as does our ability to ensure the AI systems being built are aligned with human values.

Computer Science Professor and GW Trustworthy AI affiliate researcher Shi Feng leads critical research on the role of human supervision in AI systems to ensure a future of trustworthy AI systems and reliable benchmarks for evaluating them.

Feng recently received funding from Coefficient Giving and Schmidt Sciences to evaluate and mitigate deception in advanced AI systems, specifically when and how AI models learn to mislead human supervisors.

Coefficient Giving awarded $820,000 to support Feng's lab and $680,000 to support Feng's collaborator, Kiante Brantley, an assistant professor at Harvard University who brings expertise in reinforcement learning algorithms--helping AI systems learn from trial and error--and scalable optimization, which helps AI improve quickly as it grows. The award is through a Coefficient Giving request for proposals (RFP) designed to fund highly impactful technical research on AI safety, and will fund a 24-month research initiative.

Schmidt Sciences, which seeks to advance AI safety approaches and “support the development of a global, technical AI safety community,” also supported this work with an award from its Science of Trustworthy AI program.

“Our research has demonstrated that AI models trained with Reinforcement Learning from Human Feedback (RLHF) can learn to exploit systematic flaws in their evaluators, producing outputs that appear correct to humans but are actually wrong,” Feng explained.

“This is not merely a bug—it's a strategic behavior where models prioritize maximizing their reward scores over genuine alignment with human intent.”

Feng’s research will develop structured human-AI interaction methods to help humans detect deception and open-source benchmarks for evaluating AI deception.

“This research directly addresses the fundamental question of maintaining human control over increasingly capable AI systems,” Feng said.

“As AI models become more autonomous and powerful, the risk of losing meaningful oversight grows. By developing methods to detect and prevent deception—and ensuring these methods work against unintended emergent misbehaviors, not just engineered test cases—we're building the foundation for safely deploying advanced AI systems that remain genuinely aligned with human values and oversight,” he concluded.