30 to 50 percent. That's the share of its own reinforcement learning (RL) training workflow - the process of learning from trial and error to improve responses - that Minimax's new M2.7 model handled itself. Over more than 100 rounds, it analyzed where it failed, modified its own training setup, ran evaluations, and decided whether to keep or revert each change. Minimax calls it "early echoes of self-evolution", and the benchmarks back up that something changed.
Compared to M2.5, the jump is real. M2.7 scores 86.2% on PinchBench, an agent task benchmark measuring multi-step planning and tool use, up from M2.5's 82.5%. On the Artificial Analysis Intelligence Index - a composite score used to compare frontier models across reasoning, coding, and knowledge tasks - it gained 8 points to reach 50. On SWE-Pro, a coding benchmark built around actual software engineering problems, it hit 56.22%.
Open Weights Incoming, API Already Live
M2.7 launched as a closed model - weights not publicly available - which frustrated the local LLM community that had relied on Minimax releasing M2, M2.1, and M2.5 openly on HuggingFace for private, self-hosted deployments. Minimax has since confirmed open weights are coming, which will let developers download and run the model on their own hardware, or fine-tune it (adapt the model on custom data for a specific task) without sending data to an external API.
The model uses a Mixture of Experts (MoE) architecture - 230 billion total parameters but only 10 billion active on any given request, which keeps inference costs low. Context window is 204,800 tokens, enough to process roughly 500 pages of text in a single pass.
$0.30/M Tokens vs the Competition
API pricing is $0.30 per million input tokens and $1.20 per million output tokens. For context, running 1,000 average user requests (roughly 1,000 tokens in, 500 out) costs about $0.90 total - competitive with ChatGPT at equivalent capability tiers.
Minimax positioned M2.7 primarily for agent use cases - workflows where the model chains together multiple steps, calls external tools, and validates its own outputs. On MM Claw, a 40-skill evaluation designed specifically for that, it maintained a 97% adherence rate. Whether that holds in production, where user inputs are messier and failure modes compound, is the actual test.