Why does an LLM skip some MCP tools?

An agent chooses a tool by reading its name and description against the user's request. If the name is opaque (customer_v2_handler) and the description is internal jargon (internal endpoint), the model has nothing to match on. It skips the tool and either picks a worse one or answers from memory, which reads as a hallucination. A verb-first name and a one-sentence description that says when to call the tool fix this.

What makes a tool description callable?

A callable description answers one question for the agent: when would I want to call this? It uses a verb-first name (lookup_customer, create_ticket), a single concrete sentence with domain nouns, enums instead of free-text where the value set is finite, a description on every field, and a worked example argument. The MCP spec carries all of this in the tool and inputSchema fields, so the agent reads it before it ever calls.

How does AppElixir help write tool descriptions?

The engine treats the description as product, not advice. It lints every tool contract for callability, flagging vague names, missing field descriptions, free-text where an enum belongs, and overlapping tools the agent will confuse. It also auto-drafts a callable description from the schema and the bound collection names, which the builder then edits rather than writing from a blank page.

Should I use an enum or a free-text field for tool inputs?

If the field has a finite, known set of valid values, use an enum. Agents respect enums because the valid values are in the schema the model reads. Free-text invites hallucinated values and extra round-trips. The AppElixir compiler maps a dropdown form field to a JSON Schema enum automatically, and the linter flags a free-text field whose stored values are obviously a small fixed set.

Tool Ergonomics: Descriptions an LLM Will Actually Call

This is chapter 5 of the AppElixir engine teardown, a walk through how a no-code schema becomes a live MCP server. Start at chapter one.

The previous chapter built the runtime: transports, the handshake, tools/list, tools/call, structured results. That runtime is the body. This chapter is about the face. When an agent connects, the first thing it does is read the tool list, and from that point on every decision it makes about your server is downstream of what it read. The names and descriptions are not metadata around the real interface. They are the interface the model sees.

That reframes a thing teams treat as a writing chore into an engineering concern. A tool with a perfect implementation and a bad description is a broken tool, because the agent never calls it. So the engine owns the description the way it owns validation: it lints it, it drafts it, it versions it. "The description is product" is not a slogan we put in a blog post. It is a compiler pass.

Why the agent skips a tool

An agent does not introspect your code. It has a budget of context and a list of tools, each reduced to a name, a sentence, and an input schema. When the user says "what plan is this customer on," the model scans that list and matches intent to the most plausible tool. Matching is the whole game. Consider two tools that do the identical thing:

{
  "name": "customer_v2_handler",
  "description": "internal endpoint"
}

{
  "name": "lookup_customer",
  "description": "Returns the customer record (plan, signup date, status)
                  for a given email address."
}

The first is dead weight. customer_v2_handler tells the model nothing about when to reach for it, and "internal endpoint" actively warns the model off. Faced with a question it cannot map to a tool, the agent does the worst available thing: it answers from its own weights. To the user that looks like a hallucination. It is really a UX failure in the tool contract. The second tool is callable because the name is a verb plus a noun and the sentence answers the only question the agent is actually asking: would calling this help me right now.

The rules the engine bakes in

The MCP specification gives every tool a name, a description, and an inputSchema in JSON Schema, and the spec is explicit that these fields are model-facing. (See the protocol definition for the tool shape.) Those slots are where ergonomics live. The engine enforces a small set of rules across all of them:

Verb-first names. lookup_customer, create_ticket, list_invoices. The leading verb tells the model the action class before it reads a word of description. A noun-only name like customer makes the model guess whether it reads, writes, or searches.
A one-sentence "when to call me." The description answers, in one concrete sentence, "when would I, an agent, want to call this?" using nouns from your domain. Not "handles customer data." Instead: "Returns the customer record for a given email." Specific beats complete.
Enums over free-text. If a field has a finite set of valid values, it is an enum. Agents respect enums because the legal values are right there in the schema. Free-text fields invite invented values and retry loops.
A description on every field. JSON Schema lets each property carry its own description, and the agent reads it. "Customer email, lowercase, no whitespace" removes a class of malformed calls before they happen.
Worked examples. One example argument object per tool. Agents pattern-match hard on examples; a single realistic example does more for call quality than a paragraph of prose.
No overlapping tools. Two tools the agent cannot tell apart are worse than one, because the model wastes a turn picking wrong. The engine flags near-duplicate contracts and pushes you to merge or differentiate them.

None of these are novel. The Anthropic reference servers in modelcontextprotocol/servers already write tools this way, and they are the bar the linter measures against: read their list_files or search_issues descriptions and you see the pattern, a verb, a sentence, a tight schema. What the engine adds is that you cannot ship a contract that ignores the pattern by accident.

By hand

Hand-rolling, the contract is whatever you typed into the SDK's registerTool call. Nothing checks that the name leads with a verb, that every field has a description, or that the third tool you added last month is a near-clone of the first. The contract drifts as the team grows, and the only signal that it drifted is an agent quietly choosing the wrong tool in production, which nobody sees until a customer complains.

With the engine

The contract is derived from the same visual schema that compiled the inputs and outputs (chapter two). Because the engine owns that schema, it can hold the contract to the rules above before the server is ever served. The linter runs at build time. The draft generator runs at design time. The contract is a checked artifact, not a hand-typed string.

Linting a tool for callability

The callability linter is a pass over the compiled contract, not over your prose. It reads the schema and the bound collection and emits findings the builder sees in the editor. The findings map directly to the rules:

tool: customer_v2_handler

  warn  name is not verb-first
        suggest: lookup_customer

  error description is non-specific ("internal endpoint")
        a callable description answers "when would an agent call this?"

  warn  field "status" is free-text but stored values look enumerable
        seen: active, paused, cancelled
        suggest: enum ["active","paused","cancelled"]

  warn  field "email" has no description
        suggest: "Customer email, lowercase, no whitespace"

  warn  overlaps with tool "get_customer" (0.91 input/output similarity)
        the agent will confuse these; merge or differentiate

The free-text-where-an-enum-belongs check is the one that surprises people. The engine already inferred the collection's column types when it bound the data source (chapter three), so it can sample the distinct values in a column. When a "free-text" field turns out to hold three distinct values across every row, that is an enum wearing the wrong type, and the linter says so. The overlap check compares input and output schemas across tools and flags pairs above a similarity threshold, because two tools the model cannot distinguish are a tax on every turn.

Auto-drafting a description

The harder ergonomic problem is the blank page. Most weak descriptions are weak because someone typed them under time pressure with no starting point. So the engine refuses to hand you a blank field. From the tool's verb, its input fields, and the name of the bound collection, it drafts a first description:

verb:        lookup
inputs:      email (string)
collection:  customers  (fields: email, plan, signup_date, status)

drafted description:
  "Look up a customer in the customers collection by email.
   Returns their plan, signup date, and status."

drafted example:
  { "email": "ada@example.com" }

The draft is deliberately editable, not authoritative. It gets the verb, the lookup key, and the returned fields right because those come straight from the schema, which means the builder's job shrinks to making it sound like their domain. Drafting from structure also keeps the description honest: it can only mention fields that actually exist in the collection, so it cannot promise data the tool does not return. The contract and the description cannot drift apart, because they are generated from the same source.

Seeing the contract the way the agent does

Rules and drafts get you a clean contract on paper. The last step is to look at it the way the model will. The engine's test harness, built on the open MCP Inspector, lets you read the exact tools/list payload the agent receives and fire test calls against it. Reading your own tool list cold, with no knowledge of the implementation, is the fastest way to catch a name that made sense to you and nobody else. If you cannot tell from the list alone when to call a tool, neither can the model. Chapter seven returns to the harness as the gate before ship; here it is the mirror you hold up to the contract.

Why this belongs in the engine

You could keep all of this as a style guide and trust people to follow it. Style guides lose to deadlines. The reason callability is a compiler concern and not a wiki page is the same reason validation is: the cost of getting it wrong is invisible at build time and expensive at run time. A bad description does not throw an error. It just means the agent quietly stops calling a tool that works, and the failure surfaces as a vague complaint that "the AI does not know about our customers."

Treating the description as product means the engine carries the rules, the linter enforces them, and the draft generator removes the excuse to skip them. The same move that made the runtime correct for free, identical plumbing for every server, makes the contract callable for free: identical ergonomics, checked once, applied to every tool. The data and the tool design are yours. The discipline that keeps them callable is the engine's.