The endpoint that just works

Stop paying frontier
prices for AI workloads.

The durable, cost-effective inference endpoint that just works no matter your needs. Every AI workload is continuously and automatically optimized for performance and cost — with one base URL, one key, and no code rewrite.

Get your API key See how it works

Zero-knowledgeEncrypted in transit & at restHard spend capsNo training on your data

Works with the AI stack your team already trusts.

OpenAI SDKAnthropic SDKGemini SDKCursorClaude CodeLibreChat

Native SDKs

Use the OpenAI, Anthropic, or Gemini client your product already ships.

Existing model ids

Keep your current model string. Direct Inference echoes it back unchanged.

Capability-first

Simple work stays cheap; documents, vision, code, long context, and reasoning scale up automatically.

Cost receipt

Every response itemizes tokens, price, and the request type, so you always see what each call cost.

Frontier-lab capability.
A fraction of the cost.

endpoint

rewrites

API shapes

No model selection

No choosing a model per task and re-evaluating it every time the market moves.

No smart router to build

No standing up routing in-house or juggling models across requests yourself.

No complex rewrite

Change one base URL and one key, and start seeing cost savings immediately.

No proxy tax

No AI proxy taking a percentage of your traffic. You pay per token, nothing more.

Stop overpaying for AI inference.
Keep every frontier capability.

One endpoint absorbs model churn, capability gaps, and cost drift, while each request gets only the capability it actually needs.

See the setup

Setup

Swap the URL.
Keep your code.

Point your existing SDK at the DI base URL.

Keep sending the model id your app already sends.

Let each request scale to the capability it actually needs.

After — Direct Inference handles the model work

# one endpoint; keep your model string
client = OpenAI(
    base_url="https://api.directinference.com/di/v1",
    api_key=DI_API_KEY,
)
client.chat.completions.create(model="gpt-5.5-mini", ...)

Cost transparency

Full transparency on cost
and performance, always.

Every response ships with a receipt: tokens, price, the detected request type, and what the same call would have cost at a flat top-tier rate — per request and in your dashboard. We continuously optimize each request for performance and cost, and the receipt shows exactly what you paid. No surprises on your token spend.

Response receipt

id: "gen_3f9c0a…"
request_type: "flash"
input_tokens: 2048
output_tokens: 256
cost_usd: 0.0009
cost_if_top_tier_usd: 0.0087

Capability

Cheap-first.
Capability when it counts.

Simple summaries, extraction, and JSON edges stay fast and inexpensive. Documents, images, long context, code, and reasoning promote themselves automatically. The request decides how much capability it needs, not the model name your app happens to send.

vision

document

long

code

json

reason

flash

pro

Private by construction.

Encrypted in transit and at rest, hard per-key and per-account spend caps, and never trained on your data — with the attestations to back it.

SOC 2 Type IIISO/IEC 27001HIPAAGDPRCCPA

Security & trust

Stop overpaying for the simple tail.
One endpoint, forever.

Change one base URL, keep your SDK, and let the endpoint handle cost, capability, and model churn.

Get your key Talk to an engineer

Stop paying frontierprices for AI workloads.