The Knowledge Hub

Open vs Closed Models

The capability gap narrowed sharply, but on agentic work it has not closed. What remains is an operational profile question — and the answer depends on volume, data sensitivity, and how long your workflow has to run.

11 min read · Knowledge Hub module · by Kenny

Last reviewed June 2026

What this edition covers

First June 2026 edition — capability parity as of mid-2026 mapped across the leading open-weight and closed models
Llama 4 Community License corrected: the EU restriction is a product-release decision by Meta, not a license term; 700M MAU threshold and Meta-approval clause are the real license constraints
Self-hosting cost breakeven modelled: the crossover from closed API to self-hosting requires approximately 500M tokens per day — well above typical SMB operating volumes
Multi-turn safety degradation in open-weight models documented: 25.86%–92.78% attack success rates across eight tested models (arXiv:2511.03247)
Hybrid routing architecture described: closed frontier model for reasoning, cheaper model for volume, with model abstraction built into the orchestration layer

Every six months or so, the question resurfaces: can you switch to an open-weight model and keep the quality you need? In 2023, the honest answer was no. In 2024, it was “depends on the task.” By mid-2026, the strongest open-weight models have closed most of the historical gap: MiniMax M2.5 scores 80.2% on SWE-bench Verified, within a few points of leading closed models such as Gemini 3.1 Pro (80.6%, in preview). But the frontier still leads — Claude Opus 4.8 reaches 88.6%, roughly eight points clear.¹ On everyday tasks the headline gap is largely gone; on the hardest agentic work, it persists.

What hasn’t closed is the operational profile question. Open-weight and closed models make different promises about data isolation, version stability, safety governance, and total cost of ownership. For workflows that run across months — content cadences, research agents, CRM automation — those differences compound. The decision isn’t which model scores higher on a benchmark. It’s which set of operational properties fits the specific workflow you’re running.

What “open” actually means in 2026

The industry uses “open source” to mean several different things, and conflating them creates real legal and operational risk.

Open-weight means the trained model weights are publicly downloadable and can be run locally. Training code, training data, and the compute recipe may or may not be disclosed. All of the major “open source” models — Llama, Qwen, DeepSeek, Gemma, Mistral, Phi — are technically open-weight, not fully open source. You can run the model, but you often cannot audit what it learned from or reproduce its training.

The licensing landscape divides into four practical tiers.² The most permissive are Apache 2.0 (Qwen 3, Gemma 4, Mistral Large 3) and MIT (Phi-4, DeepSeek R1) — unrestricted commercial use, no revenue thresholds, no geographic limits. DeepSeek R1’s MIT weights at near-frontier reasoning quality, benchmarking above OpenAI’s o1 on AIME 2024 mathematical reasoning at approximately 96% lower cost per output token, was the most significant open-weight capability event of 2025.³

Meta’s Llama 4 sits in a separate tier. The Llama 4 Community License permits commercial use, but it carries two constraints that matter at scale: a 700 million monthly active user threshold above which Meta approval is required, and a requirement to contract directly with Meta Platforms Ireland for EU-domiciled entities. A separate product-release decision by Meta — distinct from the license itself — means multimodal Llama 4 variants are not currently available for use in the EU over regulatory uncertainty.⁴ The license is not the same as a prohibition; the availability constraint is. Singapore-based teams building for EU markets need to track both.

License failure in production is not a hypothetical. It’s the clause your legal team finds six months after deployment.

Where the gap remains: agentic complexity

General knowledge, mathematical reasoning, single-turn coding, and instruction following are now effectively equivalent across the leading open-weight and closed models. The remaining gap concentrates in long tool-use chains, production agentic coding, and multimodal agentic tasks — the categories most relevant to sophisticated AI-operated workflows.¹

A practical finding from MindStudio’s agentic coding research makes the gap more nuanced: the same open-weight model can show up to six times performance variation depending solely on harness design — how tool calls are structured, how failures are handled, how context is managed.⁵ Open-weight model performance at the application layer is not a fixed property. It’s an engineering problem. A well-designed harness can close much of the raw performance gap; but building that harness is a non-trivial investment that is separate from choosing the model.

For workflows where the agent must plan a multi-step research task, execute tool calls, interpret ambiguous results, recover from failures, and maintain coherent reasoning across dozens of turns, closed frontier models maintain the most measurable advantage. For simple, repetitive, well-defined tasks — extraction, classification, templated generation — open-weight models are now strong candidates.

The self-hosting math

The economic case for self-hosting open-weight models is real at specific volume thresholds, and not real below them.

At 1 million tokens per day (roughly 30 million tokens per month), using the DeepInfra API for Llama 3.3 70B costs approximately $0.12 per day. Self-hosting the same model on Lambda Labs hardware costs approximately $43 per day — a 358-fold difference at that usage level.⁶ The crossover only begins to appear around 500 million tokens per day, where self-hosting offers roughly 5x savings over managed APIs — but only if GPU utilisation stays above 50%. Below that floor, per-token costs increase tenfold, eliminating the economic case.

The total cost of self-hosting is 3–5x the raw GPU rental price. ML engineering for optimisation (quantisation, sharding, inference containers) averages $145,000 per year in the United States. Model update cycles cost approximately $12,000 in engineering time per six-to-eight week cycle. Infrastructure overhead — networking, load balancing, monitoring, incident response — adds further multipliers.⁶

For context: 500 million tokens per day at an average of 2,000 tokens per workflow run is 250,000 workflow executions per day. That is not a small-to-mid business scale. RTSN-type engagements — hundreds to low thousands of agent calls per day — are solidly in the regime where closed APIs are both cheaper and operationally simpler.

Safety governance: a responsibility that doesn’t transfer

Closed frontier APIs (Claude, GPT-4o, Gemini 2.5 Pro) continuously patch prompt injection vulnerabilities, jailbreak vectors, and capability edge cases. Open-weight deployments carry only the safety properties baked in at training time.

Research published at ACL 2025 found multi-turn attack success rates between 25.86% and 92.78% across eight tested open-weight models, with safety-oriented models showing more resilience and capability-oriented models showing greater vulnerability.⁷ Fine-tuning exacerbates this: a documented failure mode (arXiv:2310.03693) shows that fine-tuning on as few as 10 adversarially designed examples can degrade a model’s safety alignment, and the same principle applies to any open-weight model fine-tuned without adversarial robustness testing.⁸

For AI-operated workflows in long-horizon agentic settings, layered security controls — input filtering, output validation, kill switches — are required, not optional. Operators of open-weight models own that responsibility entirely. Closed providers handle it as a managed service.

Anthropic’s current pricing reflects the range of these governance options. Claude Opus 4.8 is priced at $5 input / $25 output per million tokens; Claude Sonnet 4.6 at $3 / $15 per million tokens.⁹ Data residency options (US-only inference, regional endpoints through AWS Bedrock and Google Vertex AI) provide additional controls for teams with specific data handling requirements who prefer not to self-host.

The four operational variables

The decision isn’t philosophical. It follows four variables.

Volume. Below approximately 5 million tokens per month: closed API, no further analysis required. Between 5 million and 30 million tokens per month: evaluate managed open-weight hosting (Groq, DeepInfra, Together AI) as a middle path — open model economics without self-hosting complexity. Above 500 million tokens per day: self-hosting becomes cost-rational for simple, repetitive tasks.

Data sensitivity. Standard business data: closed API acceptable. Regulated data (financial, healthcare, government-adjacent): evaluate whether Singapore’s PDPA cross-border transfer safeguards or the MAS FEAT Principles create data isolation requirements for your specific client vertical. Data that cannot leave your infrastructure under any circumstances: open-weight on-premise is the only fully compliant option — no managed API, regardless of vendor, provides genuine data isolation because inference happens on vendor hardware.

Operational capacity. No MLOps capability: closed API only. Self-hosting without MLOps is a liability, not an asset. Partial capacity: managed open-weight hosting as a compromise. Full MLOps capability: self-hosting is viable — budget 3–5x the GPU cost for total operational expense.

Task complexity. Simple, repetitive, well-defined tasks: open-weight fine-tuned models are strong candidates. Complex agentic workflows requiring multi-step reasoning, tool orchestration, and recovery from failure: closed frontier models maintain a meaningful advantage. Mixed workflows: hybrid routing, which routes by task type rather than model ideology.

The architecture that holds up

The most defensible architecture for AI-operated workflows at small-to-mid business scale is not a binary choice. The pattern that holds up in 2026:

Use a closed frontier model for complex reasoning and agentic coordination — the steps that require judgment, multi-step planning, and reliable tool use. Route high-volume, repetitive tasks to cheaper models: Claude Haiku, or a managed open-weight API (DeepSeek, Qwen), for classification, extraction, and templated generation. Build model abstraction into the orchestration layer — never hard-code model names into workflow logic. Use adapters that can be swapped when a better model releases or a price changes. For data-sensitive workflows, self-host only the specific inference layer that touches regulated data; the rest of the workflow can use managed APIs.

A Zapier survey published in early 2026 found 81% of enterprise leaders concerned about AI vendor dependency, and 47% report at least one key business function would stop if their primary AI vendor experienced significant downtime.¹⁰ The structural mitigation is model abstraction at the orchestration layer — separate business logic from model API calls before the dependency crystallises. Agentic lock-in is more durable than API lock-in; switching models mid-engagement is possible, but switching agent orchestration frameworks is an architectural rewrite.

Version stability is the under-discussed open-weight advantage for workflows running across months. Closed models deprecate without your consent. For operational workflows where consistency of output is load-bearing — a brand voice agent, an audit workflow with established calibration — the ability to pin a specific open-weight model snapshot indefinitely has genuine value. The structural response for closed-model deployments is to test new model versions on a shadow traffic split before the deprecation date, not to discover the change in production.

The open-vs-closed question in 2026 is not a capability question for most tasks. It’s a question about which set of operational properties — data isolation, version stability, safety governance, total cost of ownership, infrastructure burden — best matches the profile of the specific workflow you’re running and the team running it.

The Knowledge Hub is re-checked as the research moves — each module carries a ‘what changed’ note and its last-reviewed date at the top. Follow the modules you care about; skip the ones you don’t.

Get monthly Notes from RTSN

If you’re choosing a model architecture for a workflow that needs to run across months, that’s where a Discovery Call is most useful.

Book a Discovery Call See how we work