# Direct Inference — full reference

> Direct Inference is a zero-knowledge inference endpoint for AI products. You
> swap one base URL, keep the model id your app already sends, and each request is
> classified by its shape. Behind the endpoint, model orchestration and smart
> routing scale each call to its task complexity. The response echoes your model
> id back and omits every serving internal: which model, candidate, provider, or
> version ran is never exposed. The only signal returned about routing is the
> request type. The endpoint keeps optimizing on its own — simple work served
> cheap, the best available model served for hard work, new models absorbed as
> they ship — so costs fall and quality rises with no change to your code, and
> there is no routing layer, gateway, or proxy to stand up. This is the
> extended companion to https://directinference.com/llms.txt.

Base URL: https://api.directinference.com/di/v1
Auth: send your Direct Inference API key as the SDK's API key / Bearer token.

## What it is

Direct Inference is an inference endpoint with model orchestration inside. It
uses smart routing and task-complexity scaling without making you configure a
router or live in a model picker. Transparent routers return the chosen model id
and leave you managing model slugs or task configs; Direct Inference returns
only the request type and echoes your own id back, so your integration never
tracks the model market. There is no routing layer, gateway, or proxy for you to
run — providers and pricing can change behind the endpoint with zero change to
your code.

## Drop-in SDK surfaces

One endpoint speaks three native SDK shapes. Change only the base URL (and key).

OpenAI (Python):

    from openai import OpenAI
    client = OpenAI(base_url="https://api.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY")
    resp = client.chat.completions.create(model="gpt-5.5-mini", messages=messages)
    # resp.model == "gpt-5.5-mini"  (your id, echoed back)

Anthropic (Python):

    from anthropic import Anthropic
    client = Anthropic(base_url="https://api.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY")
    msg = client.messages.create(model="claude-sonnet-4-5", max_tokens=1024, messages=messages)

Gemini (google-genai, Python):

    from google import genai
    client = genai.Client(api_key="YOUR_DIRECT_INFERENCE_KEY", http_options={"base_url": "https://api.directinference.com/di/v1"})
    resp = client.models.generate_content(model="gemini-2.5-flash", contents=prompt)

Streaming, tool/function calling, vision, PDFs, and structured output pass
through on all three.

## Request types

Every call is classified by its shape. Capability always outranks the model name.

- vision     - image content in the request; handled by a vision-capable model.
- document   - PDF or file input; document-capable processing.
- long       - input beyond the standard context window; long-context path.
- code       - tool definitions, diffs, stack traces, repo paths; coding/tool strength.
- json       - a response/output JSON schema is set; a schema-reliable model.
- reason     - multi-step reasoning in the prompt; a reasoning model.
- flash      - simple request at low effort; fast and cheap.
- pro        - everything else (default); a strong all-rounder.

The detected request type is the only routing signal returned, via the
X-DI-Request-Type response header.

## Effort ladder (optional)

Send X-DI-Effort: <level> as a header, ?effort=<level> as a query param, or your
SDK's native reasoning field (OpenAI reasoning_effort, Gemini/Anthropic thinking
budgets). Omitted means auto: the level is inferred per request. Effort biases
the serving choice; request shape still decides the needed capability, and
capability handling (vision/document/long) can promote a call above its effort
level.

- fast    - lowest latency and cost (none is accepted as an alias).
- minimal - minimal spend, trims optional steps.
- low     - light reasoning, concise answers.
- medium  - balanced behavior.
- high    - deeper reasoning and more careful synthesis.
- xhigh   - stronger quality bias, more repair budget.
- max     - maximum quality bias regardless of cost.

Where only a model id fits (model pickers, fast/smart model slots, compare
features), GET /di/v1/models also lists di-saver and di-max beside di — the same
model with effort pinned low/high, not separate models; every entry carries
root: "di".

## Model ids

Keep sending the model ids your app already uses - current, legacy, renamed, or
not-yet-released. They are treated as compatibility and intent signals; the
serving model stays hidden, so the id list is a sample, not a menu. Unknown ids
resolve to a capable model instead of erroring.

## Compatibility guarantees

- Your model id is echoed back unchanged; logging/dashboards/evals keyed on it keep working.
- Unknown, legacy, and future ids resolve to a capable model instead of failing.
- Three SDK shapes (OpenAI, Anthropic, Gemini), one base URL.
- Capability outranks the name: a PDF to a "mini" id still gets a document-capable model.
- More than load balancing: model orchestration weighs capability, quality, cost, latency, health, and error behavior.
- Failure handling (rate limits, transient errors, unhealthy paths) is handled inside the endpoint.

## Pricing

Pay per token at the rate of whichever model serves a request; no subscription,
no per-seat fees. A single effort hint biases any call toward latency, cost, or
quality. Published per 1M tokens: DI Saver $0.25 in / $1.50 out and DI Max $3 in /
$10 out (flat rates); the default DI tier is variable — priced per request, you
pay for what each one needs. Cached input is billed at a reduced rate automatically. Hard per-key and
per-account spend caps are enforced in the request path.

## Observability (your usage, not the model)

The portal shows usage by request type, per-application attribution, per-request
traces (tokens, latency, cost, detected request type), live request-type
classification in the playground, and hard spend caps. The serving model is the
only thing held back.

## FAQ

Q: Which model am I paying for?
A: The tokens of whichever model serves each request. You see the request type, never the specific model.

Q: Do I need to pick an effort level per call?
A: No. Medium is the default; the hint is optional and per-request.

Q: What happens to images, PDFs, or long context?
A: Capability outranks effort. Such requests are promoted to a model that can handle them, even on the Fast preset.

Q: Will a renamed or unknown model id cause an outage?
A: No. Unknown, legacy, and future ids resolve to a capable model.

## Links

- Product: https://directinference.com/
- Why Direct Inference: https://directinference.com/why
- Developers: https://directinference.com/developers
- Documentation: https://docs.directinference.com
- Pricing: https://directinference.com/pricing
- Security: https://directinference.com/security
- Portal: https://app.directinference.com
- Agent skill (automated migration): https://github.com/Direct-Inference/skills
- Concise index: https://directinference.com/llms.txt