BlogOpen Source Models

Can You Actually Replace Claude with a Local Model for Coding? (368 HN Commenters Weigh In)

A 771-upvote Hacker News thread asked if developers have fully swapped Claude/GPT for local models. The answers — real hardware, real numbers, real tradeoffs — reveal where AI coding actually stands in mid-2026.

Chethan·June 16, 2026

771 upvotes, 368 comments, one question: Can you actually replace Claude with a local model for coding?

Someone on Hacker News asked a simple question last week: "Has anyone here fully swapped Claude/GPT for a local model as their main coding tool — not just for side experiments?"

By this morning, it had 771 upvotes and 368 comments. That's not a viral shitpost. That's not a political flame war. That's a genuinely hard technical question that thousands of developers are quietly wrestling with right now.

And the answer, buried in those comments, is more interesting than a simple yes or no.

The short version: Yes, but with caveats you need to hear

People are doing it. Not hypothetically. Not "in the future." Right now, today, on hardware they own, with models that are free.

The consensus model of choice? Qwen 3.6 35B-A3B — a Mixture-of-Experts model with 35 billion parameters but only 3 billion active at any time. That architecture is the whole game. It's why this conversation even exists in mid-2026.

One commenter, Greenpants, put it bluntly:

"Comparing agentic Qwen3.6 35B to Claude Opus is like a junior with knowledge across the board that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me."

A junior developer who never sleeps, never asks for a raise, and costs zero dollars. That's what local coding AI looks like right now.

The hardware question nobody can avoid

Here's the part the "just run it locally" crowd tends to gloss over: you need real hardware.

The thread reveals a pretty clear landscape of what people are actually running:

Mac Studio with 128GB unified RAM — the Apple Silicon sweet spot. Unified memory means your GPU can address all of it, which is why Macs have quietly become the default recommendation for local inference. One commenter built a full Django + Wagtail site redesign on one.

Strix Halo laptops (128GB unified memory) — AMD's answer to Apple Silicon. Multiple users in the thread are running llama.cpp on these with Vulkan and getting surprisingly good performance. The AMD AI Max+ 395 chip is becoming a cult favorite.

Dual RTX 3090s (48GB VRAM total) — the budget king. At roughly $850 per card on the used market, this setup fits 300K-token contexts and generates at respectable speeds. One user reported Qwen 3.6 27B running at 65 tok/s on a single 7900 XTX.

Dual RTX Pro 6000 Blackwell — the "I'm not messing around" tier. One user, arjie, broke down the full economics: 190 tok/s decode at concurrency 1, scaling to 980 tok/s at concurrency 16. Total power draw: 585W average, 849W peak. Hardware cost: ~$20,000. Their take? At 25 users paying $23/month, or 100 users at $6/month, you break even over three years.

That last one is interesting because it's not a hobbyist flex — it's someone pricing out a real alternative to cloud API spending.

And then there's the pragmatic middle ground. The most upvoted hardware recommendation in the thread is dead simple: M4 Pro Mac Mini with 48GB unified RAM for about $2,000. No GPU shopping. No driver hell. No power bill anxiety. Just plug it in and go.

The models people are actually using

Forget benchmarks for a second. Here's what real developers reach for when they're trying to get work done:

Qwen 3.6 35B-A3B — the undisputed champion of this thread. Multiple independent users independently arrived at the same conclusion: this is the sweet spot for local coding. The MoE architecture means you get 35B-class quality at 3B-class speed. One user called it "definitely the one I reach for the most often." Another said it's "really the sweet spot for coding."

DeepSeek V4 Flash — the speed demon. The dual-RTX-Pro-6000 user reported prefill speeds of ~10,000 tokens per second. That's not a typo. Ten thousand. Someone else ran 118 million tokens through the API for $0.83.

Qwen 3.6 27B (dense) — when you need consistency over raw smarts. Users peg its quality somewhere between Claude Haiku 4.5 and Claude Sonnet 4.5. The key insight from porkloin: "The biggest thing that makes Qwen 3.x feel good is that it's the first time tool calling actually works consistently on local models."

Gemma 4 26B MoE (A4B) — the dark horse. One user reported 150 tok/s at Q4 quantization, about 3x faster than the dense 31B version with "very similar quality." That's the MoE multiplier doing its job again.

GLM 4.7 Flash — the agentic specialist. One user who tested extensively said it's "the best at coding agentically" among open models, though still not at GPT 5.5 or Opus 4.8 levels.

The pattern is obvious. The models winning the local-inference race are almost all Mixture-of-Experts architectures. When only 3-4 billion parameters activate per token instead of 30 billion, you get dramatically faster inference at a surprisingly small quality cost. This is the architectural shift that made the entire HN thread possible.

The honest problems nobody sugarcoats

This is where the thread gets genuinely useful, because people aren't selling anything. They're just sharing what breaks.

Looping behavior. Local models get stuck in loops. They repeat the same failed approach, re-read files they just read, and burn thinking tokens on dead ends. Greenpants noted this plainly: "It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying."

Lazy defaults. Leave any ambiguity in your prompt and the model takes the easiest route. Inline CSS instead of a stylesheet. Hardcoded values instead of config. The architecture suffers because nobody — not even a 35B parameter model — wants to do extra work it wasn't explicitly told to do.

The KV cache problem. This one's technical but critical. Older Qwen models couldn't preserve reasoning traces between conversation turns. Every time you sent a new message, the inference engine had to reprocess your entire conversation history — including all the thinking the model did on previous turns. That made long agentic coding sessions painfully slow.

The fix, discovered and shared in the thread: Qwen 3.6 now supports preserved thinking. In llama.cpp, you enable it with:

chat-template-kwargs = {"preserve_thinking": true}

One config change. Massive performance impact on multi-turn agentic work. This is the kind of detail you only learn from people actually running this stuff daily.

Inconsistency. This is the real tradeoff. K0balt nailed it: "When Qwen works it works like Sonnet, when it fails it fails like Haiku. It's less consistent. Once you get an idea of what it can and can't bite off, it's pretty easy to break things into chunks."

You're trading consistency for control. The cloud models are more reliable because they're bigger. The local models are free, private, and yours — but you have to manage them like you'd manage a talented but green junior developer.

The "why bother" camp makes good points too

Not everyone in the thread was a convert. Some pushback was genuinely sharp.

ojr ran the numbers: "I can use Gemini 3 Flash with the harness I built for around 8 years and still not exceed the cost of a Mac Studio with 128GB. The price for privacy is very high."

That's a real calculation. If you're an individual developer spending $20/month on a cloud API, buying $2,000+ of hardware to save that money doesn't pencil out for years. The local-first movement makes the most sense for two groups: people who care deeply about data privacy, and teams whose API bills have crossed into "this is a real budget line item" territory.

codinhood was blunter: "Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models to perform even close to Claude Code just is not worth it right now. If it was, it would be disruptive enough to be in the news."

The counterargument, of course, is that it is in the news — this thread exists because the gap is closing fast enough to be worth discussing. 771 people upvoted a question that wouldn't have made sense to ask a year ago.

There's also a third path nobody talks about

The most pragmatic comment in the entire thread might be the one about model combination. Instead of choosing between local and cloud, use both. Route the easy stuff — autocomplete, boilerplate, simple refactors — to a fast local model. Send the hard architecture decisions and complex debugging to Claude or GPT.

This is where the industry is actually heading. OpenRouter just launched something called Fusion that does exactly this. And it mirrors how good engineering teams already work: juniors handle the straightforward tickets, seniors review and tackle the gnarly ones.

The model is the developer. The routing layer is the tech lead.

What this means for you

If you're a developer watching this space, here's the honest read from 368 people who've actually tried it:

Local models for coding are real. Not as good as Claude Opus. Not as consistent as GPT 5.5. But genuinely, provably useful for daily work — if you have the right hardware and the right expectations.

Qwen 3.6 35B-A3B is the model to try first. It's the consensus pick from people with no agenda.

Unified memory is the hardware story. Mac Studio, Mac Mini M4 Pro, Strix Halo — these machines exist specifically because this use case exists now.

The gap is closing. A year ago this thread would have been short and depressing. Today it's 368 comments of people sharing real setups, real numbers, and real tradeoffs. That trajectory matters more than any single benchmark.

You don't have to go all-in. The smartest developers in the thread aren't picking sides. They're using local models for what they're good at and cloud models for what they're good at. That hybrid approach is where the real productivity lives.

The question on Hacker News was whether you can replace Claude with a local model. The answer from the community is: partially, yes, and more every month. The more interesting question is whether you even need to choose — and increasingly, the answer is no.

Want to try local AI models without building a server rack in your closet? CopperRiver runs open-source models like GLM, DeepSeek, Qwen, and Kimi right on your Mac — no API keys, no per-token costs, no cloud dependencies. It browses, codes, and automates with the same models the Hacker News crowd is raving about. Plans start at $9/month.

Can You Actually Replace Claude with a Local Model for Coding? (368 HN Commenters Weigh In)

771 upvotes, 368 comments, one question: Can you actually replace Claude with a local model for coding?

The short version: Yes, but with caveats you need to hear

The hardware question nobody can avoid

The models people are actually using

The honest problems nobody sugarcoats

The "why bother" camp makes good points too

There's also a third path nobody talks about

What this means for you

Related Reading

Try CopperRiver yourself

Explore CopperRiver

Read next