By: Monte M, Fabien Roger, Benjamin Wright, Joe Benton, Evhub, Jonathan Uesato, Hoagy
Can we mitigate alignment faking in RL training? We test three interventions: interrogation, scratchpad length penalties, and scratchpad monitors. They all can be effective, but interrogation can backfire if models lie.












