BlogOpen Source Models

GLM-5.2 Just Tied GPT-5.5 on the Hardest AI Benchmark. It's Open Source.

Z.ai dropped a 744B open-weights model the same hour the US restricted Anthropic's Fable 5. It ties GPT-5.5 on agentic benchmarks, it's MIT-licensed, and it can't be taken away from you.

Chethan·June 18, 2026·8 min read

GLM-5.2 Just Tied GPT-5.5 on the Hardest AI Benchmark. It's Open Source.

On Tuesday, the US government quietly restricted access to Anthropic's newest model. On the same day — the exact same hour, actually — a Chinese AI lab called Z.ai dropped a model that matches it.

And then they open-sourced the whole thing under an MIT license.

GLM-5.2 isn't just another model release. It's a 744-billion-parameter middle finger to the idea that frontier AI needs to be locked behind a paywall, a terms of service, and a government approval process. It scores 51 on the Artificial Analysis Intelligence Index — making it the #1 open-weights model in the world, full stop. But here's the number that actually matters: it scored 1524 on GDPval-AA v2, the benchmark for real-world agentic tasks.

GPT-5.5 scored 1514.

Let that sink in. An open-source model — one you can download, run, modify, and fork — is now statistically tied with OpenAI's flagship on the benchmark that measures whether an AI can actually do things. Not trivia. Not coding puzzles. Real, multi-step, agentic work. The kind of task where you say "research this, analyze the results, write a report, and email it to three people" and the model needs to chain together a dozen tool calls without screwing it up.

That's the benchmark that matters now. And an open model just aced it.

What GLM-5.2 Actually Is

Z.ai (formerly Zhipu AI) built GLM-5.2 as a Mixture-of-Experts model — 744 billion total parameters, but only 40 billion active during inference. Same architecture and size as GLM-5.1, which it replaces. That's important: they didn't brute-force this by throwing more parameters at the wall. They got dramatically better results from the same compute footprint. That's a training win, not a scaling win.

The upgrade is real, though. Across the board:

+11 points on the Intelligence Index vs. GLM-5.1 (40 → 51)
Scientific reasoning jumped +16 points on CritPt (to 21%), +12 on Humanity's Last Exam (to 40%)
Terminal-based agentic tasks (TerminalBench v2.1): up 16 points to 78%
Context window: 1 million tokens, up from 200K — five times larger
Hallucination rate dropped to 28.1% (from 29.4%)
Non-hallucination ranking: 3rd best of all models tested — ahead of DeepSeek, GPT-5.5, and even Anthropic's Fable 5

That last one is worth pausing on. The non-hallucination metric specifically tests whether a model will say "I don't know" when it should, instead of confidently making things up. GLM-5.2 beats every frontier model on intellectual honesty. In a world where AI hallucinations are actively causing real problems, that's not a footnote — it's arguably the most important number on the list.

The model is available on Z.ai's first-party API at $1.40 per million input tokens and $4.40 per million output tokens. That sounds expensive for open weights, and it is — but third-party providers are already undercutting it aggressively. Multiple HN commenters reported getting unlimited tokens for $50/month from independent hosting providers. Others found API rates at 3x cheaper than Z.ai's official pricing.

The model is also available on DeepInfra, Novita, Nebius, Fireworks, Siliconflow, and others. Competition does what competition does.

The Political Timing Isn't Subtle

Here's the part that makes this more than a benchmark story.

GLM-5.2 launched at 5:21 PM Beijing time on June 17. That was the exact same hour the US government delivered its restriction letter to Anthropic regarding Claude Fable 5 — the first Mythos-class model, which had just hit #1 globally after its June 9 launch.

Z.ai's founder framed the release in explicitly political terms:

"At a time when access to frontier models is abruptly cut off for non-technical reasons, we are even more convinced of one thing: science should be global. The path to AGI must never be gated."

Was the timing a coincidence? Of course not. As one HN commenter put it: "This was rushed to hang on the coattails of the Mythos drama — 'hey, sorry you can't use Fable, but try us while you wait this weekend!'"

Another commenter was even more blunt: "In the last few days, Chinese labs have given us MiniMax M3, Kimi K2.7, and now GLM-5.2. Meanwhile the US is censoring models. Reads like fiction."

And you know what? That's fine. That's how markets work. A gap opened in the market — access to frontier intelligence — and someone filled it with a product that's free, open, and good enough to use. The free market response to government restriction is, apparently, more freedom. Whoever said irony was dead?

Why This Time Actually Feels Different

We've heard "open source caught the frontier" before. Llama 2 was supposed to do it. Mistral was supposed to do it. Qwen was supposed to do it. Every six months, someone claims the gap has closed, and every six months, the proprietary labs pull ahead again.

But something shifted in 2026, and the HN discussion around GLM-5.2 captures it perfectly. Here's a comment from user unrvl22 that got significant traction:

"It's literally Opus 4.7 quality at stupid prices. I know providers offering unlimited tokens for $50/month. This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models."

And from gertlabs, which runs independent benchmarks:

"GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than, Opus 4.6."

Here's why the dynamic has changed:

1. Chinese labs are shipping faster than US labs now. In the last week alone, we got MiniMax M3 (44 on the Intelligence Index), Kimi K2.7-Code, and GLM-5.2 (51). Three frontier-tier open models in seven days. Meanwhile, the US is restricting its own models. The innovation gravity is shifting east, and fast.

2. The gap is measured in months, not years. GLM-5.2 is roughly at the level of the frontier from ~6 months ago. As one commenter noted: "If this is half a year behind, that's January Opus pre-nerf. This is it." That's not "good enough for open source." That's close enough that it doesn't matter for most use cases. The gap used to be a chasm. Now it's a crack.

3. Open weights can't be turned off. Once GLM-5.2 is on Hugging Face, it's permanent. No government can revoke access. No company can change the terms. No CEO can decide your use case violates policy. For anyone who got burned by the Fable 5 restriction — and there were a lot of them — that permanence is the entire value proposition. You can't un-open-source something.

The Honest Weaknesses

I'm not going to pretend this is perfect, because the HN commenters who actually used it won't let me.

It's text-only. No vision. No image input. This is increasingly unusual — even other open models like Gemma 4, Qwen 3.6, and Kimi 2.x all accept images. Simon Willison flagged this as a real limitation: GLM-5.2 is excellent at web design tasks, but it can't look at a screenshot and iterate on the output. For a model this good at generating HTML and CSS, the inability to actually see what it built is a weird, frustrating gap. You'd have to pair it with a separate vision model for any UI work.

It's verbose and slow. The model uses 43,000 output tokens per benchmark task — up from GLM-5.1's 26K, and nearly double MiniMax M3's 24K. Of those 43K tokens, 37K are reasoning. One commenter reported it spent 15 minutes and 45K tokens of reasoning before writing a 400-line math library. That's not a dealbreaker for batch work or background agents, but it's rough for interactive coding sessions where you're sitting there waiting.

Real-world coding isn't quite there yet. User tomerbd, who codes professionally with AI daily, ranked it bluntly: "mediocre with hand holding and super slow" compared to GPT-5.5 and Opus 4.8. Several other commenters echoed this — benchmarks say frontier, but the actual day-to-day coding experience still favors the proprietary models by a noticeable margin. One commenter running independent bug-finding benchmarks found GLM-5.2 caught 3 of 9 bugs — the same as much smaller self-hostable models like Gemma 4 and Qwen 3.6. Intelligence Index scores don't always translate to your specific workflow.

Cost per task is higher than peers. At $0.46 per benchmark task, GLM-5.2 is significantly more expensive to run than DeepSeek V4 Pro ($0.05 — yes, ten times cheaper) or MiniMax M3 ($0.18). The intelligence is there, but you're paying for all those extra reasoning tokens. DeepSeek remains the value king; GLM-5.2 is the intelligence king. Different tools for different jobs.

Who Should Actually Care

If you're an individual developer paying $200/month for a frontier coding plan, this matters because you now have a viable alternative. Not a replacement — the coding experience isn't there yet — but a backup. A floor under your costs. A model you can switch to when the API you depend on gets rate-limited, restricted, or "temporarily unavailable for policy review." One commenter put it well: GLM-5.2 plus a $20/month OpenAI sub covers all but the most extreme multi-agent workflows.

If you're a startup, this matters because you can build products on infrastructure that can't be revoked. No more "what happens to our app if OpenAI changes their terms?" The answer is: you switch to GLM-5.2, or MiniMax M3, or DeepSeek V4, and your product keeps working. Your moat becomes your product, not your API access.

If you're an enterprise worried about data sovereignty, this matters because open weights mean you can run the model on your own hardware, on your own network, with your data never leaving your perimeter. The model that ties GPT-5.5 on agentic benchmarks can run entirely inside your firewall. No cloud dependency. No third-party risk assessment. Just compute.

And if you just care about not being at the mercy of anyone — governments, megacorps, API rate limits — this is the model that proves you don't have to be. The frontier is open. And nobody can close it.

The Bigger Picture

The AI industry spent three years telling us that frontier models required billion-dollar training runs, massive compute clusters, and proprietary moats that would only deepen over time. That the gap between open and closed would widen until open source was a hobbyist plaything.

Instead, the opposite happened.

The gap is now ~6 months and shrinking. The open models are MIT-licensed. They're available from a dozen providers at a fraction of proprietary pricing. They're being released faster than the labs can close them. And the US government, in a move that will be studied in business schools for decades, just handed Chinese AI labs the best marketing campaign they could never have bought: proof that proprietary frontier models are a strategic vulnerability.

GLM-5.2 isn't the best model in the world. GPT-5.5, Claude Fable 5, and Gemini 3.5 Ultra are still ahead on raw intelligence. But "best" isn't the only axis that matters. Accessible matters. Reliable matters. Permanent matters. Can't-be-taken-away matters.

The frontier is open. And this time, it's staying that way.

This is exactly why CopperRiver runs on open-source models — GLM, DeepSeek, Qwen, MiniMax, Kimi. Not because they're cheaper (though they are). Because your AI assistant should work for you, not for someone who might decide you can't use it tomorrow. Check it out — plans start at $9/mo.

GLM-5.2 Just Tied GPT-5.5 on the Hardest AI Benchmark. It's Open Source.

GLM-5.2 Just Tied GPT-5.5 on the Hardest AI Benchmark. It's Open Source.

What GLM-5.2 Actually Is

The Political Timing Isn't Subtle

Why This Time Actually Feels Different

The Honest Weaknesses

Who Should Actually Care

The Bigger Picture

Related Reading

Try CopperRiver yourself

Explore CopperRiver

Read next