# Direct Inference — full reference > Direct Inference is a zero-knowledge inference endpoint for AI products. You > swap one base URL, keep the model id your app already sends, and each request is > classified by its shape. Behind the endpoint, model orchestration and smart > routing scale each call to its task complexity. The response echoes your model > id back and omits every serving internal: which model, candidate, provider, or > version ran is never exposed. The only signal returned about routing is the > request type. The endpoint keeps optimizing on its own — simple work served > cheap, the best available model served for hard work, new models absorbed as > they ship — so costs fall and quality rises with no change to your code, and > there is no routing layer, gateway, or proxy to stand up. This is the > extended companion to https://directinference.com/llms.txt. Base URL: https://api.directinference.com/di/v1 Auth: send your Direct Inference API key as the SDK's API key / Bearer token. ## What it is Direct Inference is an inference endpoint with model orchestration inside. It uses smart routing and task-complexity scaling without making you configure a router or live in a model picker. Transparent routers return the chosen model id and leave you managing model slugs or task configs; Direct Inference returns only the request type and echoes your own id back, so your integration never tracks the model market. There is no routing layer, gateway, or proxy for you to run — providers and pricing can change behind the endpoint with zero change to your code. ## Drop-in SDK surfaces One endpoint speaks three native SDK shapes. Change only the base URL (and key). OpenAI (Python): from openai import OpenAI client = OpenAI(base_url="https://api.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY") resp = client.chat.completions.create(model="gpt-5.5-mini", messages=messages) # resp.model == "gpt-5.5-mini" (your id, echoed back) Anthropic (Python): from anthropic import Anthropic client = Anthropic(base_url="https://api.directinference.com/di/v1", api_key="YOUR_DIRECT_INFERENCE_KEY") msg = client.messages.create(model="claude-sonnet-4-5", max_tokens=1024, messages=messages) Gemini (google-genai, Python): from google import genai client = genai.Client(api_key="YOUR_DIRECT_INFERENCE_KEY", http_options={"base_url": "https://api.directinference.com/di/v1"}) resp = client.models.generate_content(model="gemini-2.5-flash", contents=prompt) Streaming, tool/function calling, vision, PDFs, and structured output pass through on all three. ## Request types Every call is classified by its shape. Capability always outranks the model name. - vision - image content in the request; handled by a vision-capable model. - document - PDF or file input; document-capable processing. - long - input beyond the standard context window; long-context path. - code - tool definitions, diffs, stack traces, repo paths; coding/tool strength. - json - a response/output JSON schema is set; a schema-reliable model. - reason - multi-step reasoning in the prompt; a reasoning model. - flash - simple request at low effort; fast and cheap. - pro - everything else (default); a strong all-rounder. The detected request type is the only routing signal returned, via the X-DI-Request-Type response header. ## Effort ladder (optional) Send X-DI-Effort: as a header, ?effort= as a query param, or your SDK's native reasoning field (OpenAI reasoning_effort, Gemini/Anthropic thinking budgets). Omitted means auto: the level is inferred per request. Effort biases the serving choice; request shape still decides the needed capability, and capability handling (vision/document/long) can promote a call above its effort level. - fast - lowest latency and cost (none is accepted as an alias). - minimal - minimal spend, trims optional steps. - low - light reasoning, concise answers. - medium - balanced behavior. - high - deeper reasoning and more careful synthesis. - xhigh - stronger quality bias, more repair budget. - max - maximum quality bias regardless of cost. Where only a model id fits (model pickers, fast/smart model slots, compare features), GET /di/v1/models also lists di-saver and di-max beside di — the same model with effort pinned low/high, not separate models; every entry carries root: "di". ## Model ids Keep sending the model ids your app already uses - current, legacy, renamed, or not-yet-released. They are treated as compatibility and intent signals; the serving model stays hidden, so the id list is a sample, not a menu. Unknown ids resolve to a capable model instead of erroring. ## Compatibility guarantees - Your model id is echoed back unchanged; logging/dashboards/evals keyed on it keep working. - Unknown, legacy, and future ids resolve to a capable model instead of failing. - Three SDK shapes (OpenAI, Anthropic, Gemini), one base URL. - Capability outranks the name: a PDF to a "mini" id still gets a document-capable model. - More than load balancing: model orchestration weighs capability, quality, cost, latency, health, and error behavior. - Failure handling (rate limits, transient errors, unhealthy paths) is handled inside the endpoint. ## Pricing Pay per token at the rate of whichever model serves a request; no subscription, no per-seat fees. A single effort hint biases any call toward latency, cost, or quality. Published per 1M tokens: DI Saver $0.25 in / $1.50 out and DI Max $3 in / $10 out (flat rates); the default DI tier is variable — priced per request, you pay for what each one needs. Cached input is billed at a reduced rate automatically. Hard per-key and per-account spend caps are enforced in the request path. ## Observability (your usage, not the model) The portal shows usage by request type, per-application attribution, per-request traces (tokens, latency, cost, detected request type), live request-type classification in the playground, and hard spend caps. The serving model is the only thing held back. ## FAQ Q: Which model am I paying for? A: The tokens of whichever model serves each request. You see the request type, never the specific model. Q: Do I need to pick an effort level per call? A: No. Medium is the default; the hint is optional and per-request. Q: What happens to images, PDFs, or long context? A: Capability outranks effort. Such requests are promoted to a model that can handle them, even on the Fast preset. Q: Will a renamed or unknown model id cause an outage? A: No. Unknown, legacy, and future ids resolve to a capable model. ## Links - Product: https://directinference.com/ - Why Direct Inference: https://directinference.com/why - Developers: https://directinference.com/developers - Documentation: https://docs.directinference.com - Pricing: https://directinference.com/pricing - Security: https://directinference.com/security - Portal: https://app.directinference.com - Agent skill (automated migration): https://github.com/Direct-Inference/skills - Concise index: https://directinference.com/llms.txt