BlogOpen Source Models

Open Source AI Models in 2026: A No-BS Comparison

GLM-5.2, DeepSeek V4, Qwen3.7-Max, MiniMax M3, Kimi K2.7-Code — three of these dropped in the last two weeks. Real benchmarks, real weaknesses, no marketing fluff.

Chethan·June 15, 2026

Open Source AI Models in 2026: A No-BS Comparison

Three of the five models in this article came out in the last two weeks. One of them dropped two days ago. The gap between open source and closed source isn't closing anymore — it's gone. And in a few areas, open is winning.

Let's be real for a second. If you'd told me a year ago that I'd be writing about open models beating Claude Opus on agentic benchmarks, running 24-hour autonomous coding sessions, and costing a tenth of what the big labs charge, I'd have laughed. But here we are.

The pace is genuinely insane. I'm not going to sugarcoat anything either — I'll tell you which models are good, which ones have holes, and where the hype doesn't match reality. No corporate hedging. Just what you actually need to know.

Here are the five open-source (and one semi-open) models that matter right now.


GLM-5.2 (Z.ai)

GLM-5.2 came out on June 13, 2026. That's two days before I'm writing this. Z.ai (formerly Zhipu AI) clearly wanted to get this out fast — which makes sense given the context window jump they pulled off.

Here's the headline: 744B total parameters, ~40B active (MoE with 384 experts), 1M context window, 131K max output. The predecessor GLM-5.1 had 200K context and 64K output. This is a 5x context jump and 2x output jump in one iteration. That matters for coding — you can now feed it entire codebases.

It's MIT licensed (weights promised for mid-June), and here's something interesting: it's trained on Huawei Ascend chips. No NVIDIA dependency. Same as GLM-5.1. That's a geopolitical statement as much as a technical one, and it's working.

The big architectural change is that GLM-5.2 adopted DeepSeek Sparse Attention (DSA) — the same innovation DeepSeek V4 pioneered. It has two thinking modes: "High" and "Max." Use Max for coding.

Now here's my issue: GLM-5.2 published zero first-party benchmarks at launch. Zero. For a model this important, that's... a choice. The predecessor GLM-5.1 scored 58.4 on SWE-bench Pro, 86.2 on GPQA Diamond, and 92.7 on AIME 2026. Presumably 5.2 improved on those, but they haven't shown us the numbers. Community reports have it at 42.8 on BridgeBench Reasoning (claimed #1) and running at ~300 tokens/sec.

The GLM Coding Plan uses prompt-based pricing at $10-80/month tiers. That's roughly 1/10th the cost of Claude Max. The pricing model is unusual though — if you're used to per-token pricing, it'll confuse you at first.

Best for: Long-horizon agentic coding. 8-hour autonomous coding sessions. Massive context for entire codebases.

Weaknesses: No published benchmarks (frustrating). Creative writing is still flat. The pricing model might confuse people.


DeepSeek V4 (DeepSeek)

Released April 24, 2026, DeepSeek V4 is the reasoning monster. And it came in two flavors:

  • V4-Pro: 1.6T total / 49B active — the largest open-source model ever shipped
  • V4-Flash: 284B total / 13B active — fast, cheap, surprisingly capable

Both have 1M context windows. Both are MIT licensed with open weights.

Here's where it gets interesting. DeepSeek V4 introduced DeepSeek Sparse Attention (DSA) combined with token-wise compression. This isn't a minor optimization — it's what makes 1M context economically viable. Without sparse attention, running a million tokens of context would bankrupt you. With it, it's affordable.

And here's the thing: DSA is spreading. GLM-5.2 adopted it. MiniMax built their own variant (more on that below). DeepSeek's architecture innovation is quietly becoming the industry standard. That's a big deal.

The benchmarks are serious:

BenchmarkScore
SWE-bench Pro59.0
SWE-Verified80.6
GPQA Diamond90.1
LiveCodeBench93.5
HMMT 2026 Feb95.2
MMLU-Pro87.5

HMMT 2026 at 95.2. That's a math olympiad-level benchmark. A 13-49B active param model is doing competition math that would make most humans sweat.

Best for: Math, STEM, code reasoning. It's the thinking machine.

Weaknesses: It can be slow on hard problems. Extended reasoning chains run thousands of tokens before you get an answer. It's also verbose and formal in tone — not exactly chatty.


Qwen3.7-Max (Alibaba)

Released May 20, 2026. ~1T MoE (estimated — Alibaba hasn't officially disclosed). 1M context window.

Here's the complicated part: Qwen3.7-Max is closed. Proprietary. API only via Alibaba Cloud. But Qwen3.6 remains open under Apache 2.0. Alibaba is playing both sides — open the old model to build goodwill and developer adoption, monetize the new one through their cloud.

And the numbers on Qwen3.7-Max are the best in this entire comparison:

BenchmarkScoreNote
SWE-bench Pro60.6Highest of all models listed
Terminal Bench 2.069.7Highest — this is the agentic benchmark
GPQA Diamond92.4Beats Claude Opus 4.6's 91.3
HMMT 2026 Feb97.1
IFEval94.3Best instruction following
WMT24++85.8Best multilingual (201+ languages)
MCP-Mark60.8
MCP-Atlas76.4Best MCP/tool integration

The most "agentic" model here. Alibaba claims it ran a 35-hour autonomous kernel optimization with 1000+ tool calls. Thirty-five hours. Unsupervised. That's not a chatbot — that's an autonomous agent.

The MCP scores (60.8 MCP-Mark, 76.4 MCP-Atlas) matter if you're building tool-using agents. That's the best in this comparison. It also handles 201+ languages and crushes office productivity tasks.

Best for: Agent orchestration, long-horizon autonomous execution, multilingual work.

Weaknesses: The flagship is closed. The open Qwen3.6 is good but not at this level. You're trading openness for raw capability, and that tradeoff is real.


MiniMax M3 (MiniMax AI)

Released June 1, 2026. ~428B total / ~23B active (MoE). 1M context window (guaranteed minimum 512K), up to 512K output. Open weights on HuggingFace under the MiniMax Community License.

This is the efficiency monster. 23B active parameters — the smallest active footprint in this comparison — and it punches way above its weight.

MiniMax brought back sparse attention for this generation (M2 used full attention). Their version is called MSA (MiniMax Sparse Attention), based on GQA with block-level sparse selection on real, uncompressed KVs. The efficiency numbers at 1M context are wild: 1/20th per-token compute vs the prior generation, 9x faster prefill, 15x faster decoding, ~100 tokens/sec output (about 3x faster than Opus).

It's also natively multimodal — text, image, and video from step one. Not bolted on. Built in.

The benchmarks are strong but not top-of-class:

BenchmarkScoreNote
SWE-bench Pro59.0%Beats GPT-5.5 and Gemini 3.1 Pro, approaches Opus 4.7
Terminal-Bench 2.166.0%
BrowseComp83.5Surpasses Opus 4.7's 79.3 — this is wild
MCP Atlas74.2%

BrowseComp at 83.5 beating Opus 4.7 at 79.3 is genuinely shocking. That's a web browsing/research benchmark. An open model beating the most expensive closed model at finding information on the internet.

But the real story is the demos. MiniMax showed M3 running autonomously for 12 hours to reproduce an ICLR 2025 paper — 18 commits, 23 figures, zero human help. Then they showed it running ~24 hours on CUDA kernel optimization: 147 submissions, 1,959 tool calls, pushed GPU utilization from 7.6% to 71.3% (a 9.4x speedup). No human intervention. For a 23B active model. That's absurd.

Pricing: $0.30/M input via Together/OpenRouter, $0.60/M direct from MiniMax, $1.20-2.40/M output. With cache, input drops to $0.06/M. Six cents per million tokens. That's practically free.

Best for: Efficiency and real-world autonomous tasks. If you want the most capability per dollar, this is it.

Weaknesses: Not the absolute best at any single coding benchmark. PostTrainBench is #3 (behind Opus 4.7 and GPT-5.5). Some benchmarks were run on their own infrastructure with scaffolding — independent verification still pending.


Kimi K2.7-Code (Moonshot AI)

Released June 12, 2026 — three days ago. 1T total / 32B active (MoE, 384 experts, selects 8+1 shared per token). 256K context window. Modified MIT license, open weights on HuggingFace.

Important: This is a coding specialist, not a general-purpose model. If you want general-purpose Kimi, that's still K2.6. K2.7-Code is purpose-built for long-horizon coding, period.

It's natively multimodal (MoonViT 400M vision encoder) and has one unusual constraint: thinking mode is always on. You can't disable it. If you try, it routes you to K2.6 instead. Moonshot clearly decided that for coding, you always want reasoning. The upside: it uses ~30% fewer thinking tokens than K2.6, so it's actually cheaper to run despite being better.

And "better" is an understatement. Look at these jumps from K2.6:

BenchmarkK2.7-CodeK2.6Improvement
Kimi Code Bench v262.050.9+21.8%
Program Bench53.648.3+11%
MLS Bench Lite35.126.7+31.5%
MCP Atlas76.069.4+9.5%
MCP Mark Verified81.172.8+11.4%

A +21.8% jump on Kimi Code Bench in a single iteration. For context: GPT-5.5 scores 69.0 on Kimi Code Bench v2 and Claude Opus 4.8 scores 67.4. K2.7-Code at 62.0 isn't there yet, but it's closing the gap fast — and it's open, running at $0.95/M input (cache miss) / $0.19/M input (cache hit) / $4.00/M output.

Kimi Code plans start at $19/month.

Best for: Pure coding. If your workload is "write code, fix code, run code," this is the specialist.

Weaknesses: Don't use it for creative writing or casual chat. 256K context is the smallest here. Thinking mode always-on means you can't skip reasoning for simple tasks — every response pays the thinking tax.


The Sparse Attention Revolution

Here's a pattern you should notice: three of these five models use sparse attention. DeepSeek V4 pioneered DSA. GLM-5.2 adopted it. MiniMax built MSA (their own variant).

This isn't coincidence. It's the unlock for million-token context windows at affordable prices. Dense attention scales quadratically — doubling context quadruples compute. That's why 1M context used to be a pipe dream. Sparse attention breaks that quadratic wall by only attending to the tokens that matter.

DeepSeek deserves credit here. DSA is spreading to other labs' models. When your architecture becomes the thing competitors copy, you've won something real. DeepSeek is doing to attention what Google did to scaling laws — quietly setting the template everyone else follows.


Cost: Open vs. Closed

Let's talk money. Here's what the closed models cost:

ModelInputOutput
GPT-5.5$5/M tokens$30/M tokens
Claude Opus 4.8$5/M tokens$25/M tokens

And the open models:

ModelInputOutput
MiniMax M3 (OpenRouter)$0.30/M$1.20-2.40/M
MiniMax M3 (cached input)$0.06/M
Kimi K2.7-Code (cache miss)$0.95/M$4.00/M
Kimi K2.7-Code (cache hit)$0.19/M$4.00/M
GLM-5.2 Coding Plan$10-80/month flat

You can run MiniMax M3 at $0.06/M cached input. Claude Opus 4.8 is $5/M. That's an 83x difference on input cost. The output cost gap is 10-20x. GLM's prompt-based pricing is roughly 1/10th of Claude Max.

If you're building anything that runs at scale — agents, automation pipelines, batch processing — the economics aren't even close. Open source wins on price by an order of magnitude.


So Which One Should You Use?

There's no single winner. That's the honest answer. It depends on what you're doing:

  • Pure coding? Kimi K2.7-Code or GLM-5.2 (Max mode).
  • Math and reasoning? DeepSeek V4. Nothing else here touches 95.2 on HMMT.
  • Agent orchestration and autonomy? Qwen3.7-Max — if you're okay with closed.
  • Maximum bang for the buck? MiniMax M3. 23B active, $0.06/M cached, BrowseComp beating Opus.
  • Massive context for entire codebases? GLM-5.2 (1M context, 131K output).

If you want to actually use these models on your Mac — browse websites, run terminal commands, read files, automate tasks — that's what CopperRiver does. It runs these open-source models locally and in the cloud, connects them to your actual system, and lets them do real work. Plans start at $9/month. It's the bridge between "these models are impressive on paper" and "these models are doing my work."


The Real Takeaway

Here's what I keep coming back to: three of these models came out in the last two weeks. GLM-5.2 dropped two days ago. Kimi K2.7-Code three days ago. MiniMax M3 two weeks ago. DeepSeek V4 is the oldest model on this list and it's from April — barely two months old.

The closed labs used to set the pace. OpenAI drops a model, everyone scrambles for six months. Now? Open source is iterating faster than the closed labs can ship. Qwen3.7-Max beats Claude Opus 4.6 on GPQA Diamond. MiniMax M3 beats Opus 4.7 on BrowseComp. Kimi K2.7-Code is within striking distance of Opus 4.8 on coding benchmarks — and it's 1/6th the cost.

The gap isn't just closed. Open source is pulling ahead in some areas, and doing it faster, cheaper, and more openly.

The big labs have a problem. They just don't know it yet.

#open source AI#GLM-5.2#DeepSeek V4#Qwen3.7-Max#MiniMax M3#Kimi K2.7#model comparison#LLM benchmarks

Try CopperRiver yourself

A desktop AI assistant that browses, codes, and automates. Plans from $9/mo.

Read next