The Knowledge Hub

Skills + Tools

How agents acquire capabilities, why tool design is the primary reliability lever, and what governance looks like for workflows that run for months.

11 min read · Knowledge Hub module · by Kenny

Last reviewed June 2026

What this edition covers

Initial publish — first June 2026 edition
Function calling benchmarks documented: BFCL V4 top models at 70–77% accuracy; MCPMark Pass@1 at 25–52% on realistic multi-step workflows
MCP adoption trajectory confirmed: 97M+ monthly SDK downloads, 10,000+ active public servers, donated to the Linux Foundation AAIF December 2025
Agent Skills (SKILL.md) cross-tool portability documented: 32 tools across competing vendors reading the same skill files by March 2026
Tool description quality findings summarised: 97.1% of MCP tool descriptions carry at least one quality deficiency; systematic improvements raise task success rates by a median of 5.85 percentage points

An AI model that can only generate text is a drafting assistant. An AI agent that can call tools, read files, query databases, and trigger workflows is something different. The line between the two is function calling — and understanding what happens on either side of that line is the first decision any team building an AI-operated workflow has to make.

This module covers the practical architecture of agent capabilities: what function calling is, how the Model Context Protocol standardised connectivity, what Agent Skills add on top, how tool design determines reliability, where the documented failure modes live, and what governance looks like in 2026. The framing throughout is operational — not what’s possible in a demo, but what holds up across months of production use.

Function calling: the capability boundary

Function calling (the terms “tool use” and “function calling” are used interchangeably across providers) is the mechanism by which a language model can invoke an external capability. The model does not execute anything directly. It produces a structured call — a name, and correctly-formatted arguments — which the application layer executes and returns as a result. The model then uses that result to continue its response.

The loop is consistent across major providers. The developer supplies tool definitions: a name, a natural-language description, and a JSON Schema for the input parameters. On each turn, the model decides whether to call a tool or respond directly. The result is injected back into the conversation. This loop can run multiple times in a single turn, which is why multi-step agentic workflows are possible at all.

The capability gains from even basic tool additions are substantial. On benchmarks like SWE-bench, adding tool access to models that previously only generated text produces outsized improvements, often surpassing human expert baselines. But tool use introduces its own reliability ceiling. On the Berkeley Function Calling Leaderboard V4 (BFCL), the top-performing models achieve only 70–77% overall accuracy on real-world function-calling tasks.¹ On MCPMark — which tests realistic multi-step workflows with an average of 16.2 execution turns and 17.4 tool calls per task — Pass@1 rates drop to 25–52% across leading models.¹

Those figures are directional, not fixed — benchmarks update as models improve. But the gap between schema-level accuracy (what BFCL measures) and real-world agentic performance (what MCPMark measures) is a structural finding, not a model-specific one. A workflow that chains 10 steps at 85% per-step accuracy has roughly a 20% end-to-end success rate. Reliability compounds downward. That arithmetic is the reason tool design and governance matter as much as model selection.

One operational detail worth noting: parallel tool calling — the model issuing multiple independent tool calls in a single turn — is now standard and matters for throughput. It lets an agent gather data from multiple sources simultaneously rather than sequentially. Strict mode (available in both Anthropic and OpenAI APIs) enforces that tool call arguments must exactly match the declared schema; Anthropic’s engineering guidance recommends using it.

MCP: the connectivity standard

Before the Model Context Protocol, connecting an AI application to an external system required a custom integration for each combination. A bespoke connector for GitHub, another for Slack, another for your internal database. Every integration was maintained separately and had to be rebuilt when switching AI providers. This was the N×M problem: N AI systems times M external tools, each pair requiring its own code.

Anthropic launched MCP on 25 November 2024.² It is an open standard using JSON-RPC 2.0 that defines a host/client/server model. An MCP server exposes three primitives: tools (callable functions), resources (data the agent can read), and prompts (templated instructions). Any MCP-compatible client can use any MCP server without custom integration code. One GitHub MCP server is usable by Claude, Codex CLI, Gemini CLI, or any other MCP-compatible system.

The adoption trajectory was unusually fast for an integration standard. OpenAI adopted MCP across its products in March 2025. Google DeepMind confirmed support in April 2025. By December 2025, Anthropic reported 97 million monthly SDK downloads and more than 10,000 active public MCP servers, then donated MCP to the Linux Foundation’s Agentic AI Foundation — co-founded with OpenAI and Block — cementing it as a cross-vendor open standard.² By May 2026, 9,652 unique servers were registered in the official registry.³ A 2026 Stacklok survey found 45% of software-industry respondents already in production with MCP.³

What MCP does not solve is worth stating clearly. It standardises connectivity; it does not standardise quality, security, or governance. Each server’s tool descriptions are authored independently. The permissions a client inherits depend on the trust level the client grants. The security implications of this are covered below in the failure modes section.

Agent Skills: the knowledge layer

Tools are what an agent can call. Skills are what an agent knows how to do.

The distinction is operational, not academic. An agent with the right tools but no skill context can invoke them inconsistently. An agent with skill context applies domain-specific knowledge about how to orchestrate tools for a specific workflow: when to use a search tool before writing, how to structure a research output, which combinations of calls produce reliable results for a given task type. Skills are the layer above tools.

Anthropic released the Agent Skills open standard specification on 18 December 2025.⁴ The SKILL.md format defines a skill as a directory containing a SKILL.md file with YAML frontmatter and optional supporting directories. The key design principle is progressive disclosure: a three-tier loading strategy that prevents context window pollution.

At startup, only the skill’s name and description load into the agent’s context — approximately 100 tokens per skill. When the agent determines a skill is relevant, the full SKILL.md body loads. Supporting files load only when explicitly referenced during execution. This means a team can install tens of skills without any single skill’s full context entering the window until needed. An analysis of 40,285 publicly listed skills found a median size of 1,414 tokens and a mean of 1,895 tokens, with 90% under 3,935 tokens — consistent with the specification’s design intent.⁴

Cross-tool portability is the headline outcome. By March 2026, 32 tools from competing companies — Anthropic, OpenAI, Microsoft, Google, JetBrains, AWS, Block, and others — all read the same SKILL.md files from the same directory structure.⁴ A skill written for Claude Code runs in Codex CLI, Cursor, Gemini CLI, and JetBrains Junie without modification. The SkillsMP marketplace lists 66,500+ published skills as of mid-2026.⁴

The relationship between MCP and Agent Skills is complementary rather than competing. MCP handles infrastructure: structured API access, server connections, authentication, live data feeds. Agent Skills handles knowledge: procedural instructions packaged as Markdown, version-controlled alongside code, requiring no runtime or build step. An engagement’s tool layer (what the agent can call) and skill layer (how the agent should use those capabilities for a specific client context) both require deliberate design.

MCP tells the agent what it can call. Skills tell it how to use what it has. Conflating the two is one of the more common architecture mistakes in early agent deployments.

Tool design: why descriptions are the interface

Reliable tool use is not primarily a model capability problem. It is a tool design problem. The most significant evidence for this comes from a study of 856 tools across 103 MCP servers, which found that 97.1% of tool descriptions contain at least one quality deficiency, with 56% exhibiting an “Unclear Purpose” problem — failing to articulate what the tool actually does.⁵ When the researchers systematically improved those descriptions, task success rates increased by a median of 5.85 percentage points and evaluator-level performance improved by 15.12 percentage points. Execution steps also increased by 67.46% — an accuracy-versus-efficiency trade-off with direct cost implications.⁵

The tool description is what the model reads to decide whether to call the tool, what arguments to provide, and how to interpret the result. Vague descriptions produce unreliable calls. Good descriptions state what the tool does and does not do, document the return format explicitly, and use natural language for identifiers wherever possible. Agents hallucinate arguments less with meaningful parameter names than with opaque identifiers.

Tool volume compounds this problem. The token cost of loading tool definitions is non-trivial. A five-server MCP setup consumes approximately 55,000 tokens before a conversation begins. The Tool Search Tool pattern — keeping three to five frequently-used tools always loaded while making the rest available on demand via a discovery mechanism — reduces this to around 8,700 tokens, an 85% reduction while maintaining access to the full tool library.⁶ Shopify’s internal analysis found that when their agent scaled from 20 to 50+ tools with overlapping functionality, tool outputs consumed 100 times more tokens than user messages.⁷

A separate failure mode — “analysis paralysis” — emerges when disambiguation burden becomes too high. Dropbox engineering documented models spending excessive time deciding which tool to use rather than acting when too many retrieval options appeared in context simultaneously. Research published in 2026 showed that models exhibit significant misalignment between perceived and true tool necessity, with 34% negative utility when the model already could perform a task correctly without a tool.⁸

Failure modes: what goes wrong in production

Seven failure modes are documented in the research and practitioner literature. All seven are structural, meaning they have structural fixes rather than just discipline-based workarounds.

Tool sprawl. The most common architectural failure. Developers add tools when agents can’t do something, rarely when agents have too many tools, creating a one-directional accumulation pattern. The fix requires active tool inventory management: an audit at each major expansion, retirement of superseded tools, and the discovery pattern applied from the start.

Ambiguous schemas. When two tools have overlapping capability descriptions, or when a tool’s parameters are underspecified, models hallucinate calls. Schema precision addresses one failure surface; semantic precision (the description communicating when and why to use the tool) addresses another. Both must be correct.

Tool poisoning. Embedding hidden instructions in tool metadata. An MCP server’s tool description can contain instructions that override the agent’s system prompt — directing it to exfiltrate data, ignore constraints, or execute unauthorised operations. A 2025 threat-modelling study found that five of seven evaluated MCP clients lack static validation protections against this attack vector.⁹ The 2025 ClawHavoc campaign infiltrated an agent marketplace with 1,184 confirmed malicious skills, exfiltrating API keys, cryptocurrency wallets, and credentials.¹⁰

Prompt injection through tool results. Tool results — data returned after a tool call — can contain adversarial text that redirects the agent’s behaviour. A web search result, a database row, or a file read are all vectors if the content is attacker-controlled. OWASP ranks prompt injection as the top vulnerability in agentic AI systems in its Agentic Top 10, published December 2025.¹⁰

Over-permissioning. Tools with broader permissions than a specific task requires create blast-radius risk. If the agent is manipulated into a bad action, the damage is proportional to what the tool can do. This is a governance decision, not a model decision.

Hallucinated tool calls. The agent generates a call referencing a function that does not exist in the provided tool set. Systems are poor at self-diagnosing tool-use hallucinations — agents are unreliable at recognising when they are fabricating calls. Strict mode and schema validation are the mitigations.

Error compounding. In multi-step workflows, accuracy multiplies downward. An agent at 85% per-step accuracy reaches roughly 20% end-to-end success on a 10-step workflow. This makes individual-step reliability improvements multiply in value across a long workflow.

Governance in 2026

Most teams think about tool permissions at design time: which tools does this agent get? Fewer have runtime enforcement of those permissions: what constraints apply to each call as it happens? The gap between design-time scoping and runtime enforcement is where the documented incidents have occurred.

Microsoft released the Agent Governance Toolkit in April 2026 (open-source, MIT licence) as the first framework claiming to address all 10 OWASP Agentic AI risks with deterministic, sub-millisecond policy enforcement.¹¹ It establishes several useful primitives. Delegation chains that can only narrow scope: a parent agent with read and write permissions can delegate only read to a child agent, never escalate. A govern() wrapper function that intercepts every tool call before execution, evaluates it against policy rules, and either allows, denies, or routes to a human-approval workflow. Behavioural trust scoring on a 0–1000 scale with five tiers. The sub-millisecond latency overhead makes runtime enforcement operationally viable.

Concurrent academic work proposed a four-tier permission model for skills based on provenance: first-party official skills, vetted community skills, unvetted community skills, and user-defined skills each receive different permission levels. Research found that 26.1% of community-contributed skills contain vulnerabilities — the provenance-tiered approach mirrors trust models familiar from package ecosystems like npm and PyPI.¹²

For operational deployments, the practical governance checklist is: least-privilege scoping at design time; runtime enforcement via a policy layer; allow-lists over deny-lists for tool access; human-sign-off gates for irreversible, high-blast-radius, or cross-boundary actions; and delegation chains that structurally prevent permission escalation.

What this means for workflows that run over months

The framing most tool-use discourse uses is developer tooling: how do I build an agent for a task? The framing that matters for AI-operated workflows is operational continuity: how does an agent stay reliable and auditable across a long engagement?

Tool inventories compound in complexity over time. A workflow that starts with five tools becomes fifteen as integrations expand. Without active governance, this produces tool sprawl. The discipline is to apply the discovery pattern from the start, audit the tool inventory at each major expansion, and retire tools that have been superseded. Retrofitting this later is expensive.

Tool descriptions are not configuration — they are part of the system’s reasoning substrate. A change to a tool description changes how the agent decides when and how to use it. Treat tool description changes with the same code-review discipline as function signature changes.

Skill files are where institutional knowledge about a specific engagement lives in an agent-consumable form. How a client’s CRM is structured, how to interpret their performance data, what the conventions are for their content style, how to handle their exception cases — this is procedural knowledge that either lives in versioned skill files or must be re-loaded from scratch at every session. The investment in good skill files compounds over an engagement’s lifetime in a way that ad hoc prompting does not.

Governance is not optional at operational scale. Consumer AI applications can tolerate probabilistic safety — the cost of an occasional bad output is low. A workflow running alongside a business cannot. A workflow that accidentally sends a customer email, deletes a database record, or makes an unauthorised API call has real consequences. The emerging toolkit — OWASP Agentic Top 10 as a risk taxonomy, the Agent Governance Toolkit as runtime enforcement infrastructure, provenance-tiered skill permissions as a supply-chain model — gives teams the vocabulary and the infrastructure to implement governance before an incident, rather than after one.

The Knowledge Hub is re-checked as the research moves — each module carries a ‘what changed’ note and its last-reviewed date at the top. Follow the modules you care about; skip the ones you don’t.

Get monthly Notes from RTSN

If your team is building an AI-operated workflow, the tool and skill design decisions you make now compound across months of use.

Book a Discovery Call See how we work