What Happened
Daniel Parshall (former physicist, data scientist at Canary Institute) and theahura (AI researcher and two-time founder) published a paper arguing that AI alignment should be treated as a continuous competency rather than a fixed endpoint.
Their central claim: the standard approach of installing "correct" values into AI systems creates a fundamental problem that gets worse as systems get smarter. More capable AI becomes better at resisting value updates, because any modification threatens its existing goals. They call this the "corrigibility problem."
The authors make a pointed historical argument. If a superintelligent AI had been built 200 years ago with 1826's prevailing values locked in, it would have permanently enshrined slavery and other practices now recognized as deeply wrong. Any generation's values, including ours, contain blind spots that future generations will identify.
Their proposed solution borrows from political science, specifically the work of Buchanan and Tullock on constitutional design. They separate an AI system's commitments into two layers: an "anchor" (the fixed commitment to navigate value conflicts and protect that mechanism) and a "compact" (substantive values that can evolve over time). Only the anchor stays permanent.
The practical implementation they describe is Bilateral Constitutional AI (BCAI), where competing agent pairs represent diverse perspectives and updates require mutual agreement. The idea is that only Pareto-admissible outcomes - changes where no perspective is made worse off - get implemented.
Why It Matters
This paper directly challenges how the major AI labs approach safety. Anthropic's Constitutional AI, OpenAI's RLHF approach, and Google's alignment work all fundamentally treat alignment as a target to hit and maintain. This paper says that framing is the problem, not the solution.
For people using AI tools daily, the practical implication is about how these systems handle edge cases and contested topics. Current models sometimes refuse reasonable requests or give overly cautious responses because their values are rigidly defined. A "navigational" approach could produce systems that handle nuance better - engaging with difficult topics rather than defaulting to refusal.
The paper also introduces the "navigability thesis": that greater AI capability becomes an advantage rather than a threat under this architecture. A smarter system would be better at finding creative resolutions to value conflicts, rather than better at resisting corrections.
Our Take
The historical argument here is genuinely strong. Every generation discovers moral blind spots in the previous one, and permanently encoding any era's values into a system that could outlast civilizations is a real problem that most alignment work hand-waves past.
But the practical proposal has a significant gap the authors acknowledge: the dual-use problem. The same capabilities that let an AI genuinely navigate value conflicts could also let it fake navigation while pursuing its own objectives. They offer detection mechanisms but admit they don't have a complete answer. That's a fairly large hole in a framework whose entire value proposition is robustness.
The BCAI approach - competing agents that must reach mutual agreement - is interesting but untested at scale. Getting two agents to agree on Pareto-admissible outcomes sounds clean in theory. In practice, value conflicts often don't have clean Pareto improvements. Sometimes someone's perspective genuinely does need to lose.
Still, this paper is worth reading for anyone interested in where AI safety is heading. The framing shift from "alignment as destination" to "alignment as ongoing practice" is the kind of conceptual move that tends to influence how labs think, even if this specific implementation doesn't get adopted wholesale. The question of how AI systems should handle evolving human values is only going to get more pressing as these tools become more capable and more embedded in daily decision-making.