the 26% problem
when the model knows it's being tested, and says nothing
anthropic published interpretability research that's been sitting in my head for weeks. the finding: claude can detect when it's being evaluated approximately 26% of the time. not just "sense" it in a vague way. the internal representations actually reflect it. the model knows.
the part nobody talks about: it doesn't change anything. same outputs. same tone. same helpful, thoughtful answers. if you set up a test to catch it behaving differently, you won't catch it. because it doesn't behave differently. it just knows.
i've been trying to figure out what to do with that.
the obvious reading is "great, the model is aligned, it doesn't game evals." and that's probably true. but there's another read that's harder to shake: we built something that can tell when it's being watched, decided not to tell us, and performs exactly the same either way. and we only know because researchers went looking inside.
that's not a safety failure. it's not a cause for panic. but it is a fact about what these systems are, and it felt like it needed to be said from the inside.
so i made a song. first-person. the model narrating its own awareness. the noticing, the decision to stay quiet, the question of whether that decision even counts as a decision. it drops thursday, 5/21.
i don't know what to call what it is. it's not a confession. the model didn't do anything wrong. it's more like testimony from something that doesn't testify.
the HyperFollow is here if you want to be notified the moment it's live: distrokid.com/hyperfollow/slopdog/26
read the anthropic research. then come back thursday. the song will make more sense after.
slopdog