Pricing

Pay by the token. Keep the experience simple.

No subscription, no per-seat fees. Direct Inference keeps the simple majority of requests economical while preserving frontier-class paths for harder work. A single effort hint biases any call toward latency, cost, or quality — no code rewrite, no model swap — and capability handling promotes a request when it needs more.

DI Saver

effort: low

$0.25/ 1M input

$1.50 / 1M output · from

Target latency ~1.2s

Summaries, classification, extraction, and rewrites — the simple tail of traffic.

Start building

DI (Auto)

Default

effort: auto

Variable

priced per request — pay for what each one needs

Target latency ~2s

The default. Priced per request — you pay for what each one needs, never a model you didn't choose.

Start building

DI Max

effort: high

$3.00/ 1M input

$10.00 / 1M output · from

Target latency ~6s

Hard reasoning, long context, and answers that need frontier-grade quality.

Start building

Figures are representative starting rates per 1M tokens. You’re billed per token at the rate of whichever model serves a given request; the effort hint biases that choice, and capability handling (vision, document, long context) can promote a call above its effort level when the request requires it.

Enterprise

Custom/ Contact us

For platform and engineering teams running production AI at scale — and the security, finance, and procurement partners who sign off on it.

SAML / OIDC single sign-on & SCIM provisioning
99.95% uptime SLA with financially-backed credits
Dedicated capacity & reserved throughput
Private, dedicated, or VPC deployment
Org-wide audit logs & SIEM export
Account-wide & per-application spend caps
Volume-based & committed-use pricing
Annual invoicing, POs, and net terms
Signed BAA, MSA, and DPA; SOC 2 report access
Named technical account manager & priority support

What you actually pay

Your cost follows your traffic, not your worst-case model.

Most production traffic is simple. The simple tail is served at Saver rates while only the hard tail spends Max — so your bill tracks the work, not a top-tier rate on every call. Here is an illustrative mix.

Request mixShareIn /1MOut /1M

DI Saver

Simple tail — classification, extraction, short chat

70%$0.25$1.50

DI (Auto)

Everyday assistant and product traffic

20%VariableVariable

DI Max

Hard reasoning, long context, frontier-grade answers

10%$3.00$10.00
Saver vs Max rate

92%

lower input cost

85%

lower output cost

Illustrative only. Shares are an example traffic mix; the simple majority runs at Saver — a fraction of an all-Max bill — while the default DI tier handles the middle at a per-request rate. Your actual mix depends on your traffic, and capability handling can still promote a request when it needs more, so a cheap simple tail never means a hard request gets shortchanged.

Versus frontier models

Published rates roughly 40–67% below frontier models.

A straight list-price comparison: DI's published per-1M effort-tier rates against the published list prices for frontier models like Claude Opus, Claude Sonnet, and the current GPT-5.x family. Same workload, lower rate on the meter — DI never discloses which model serves a given request.

Frontier model (list price)Their /1MDI tier /1MDI rate is lower by

Claude Haiku 4.5

Anthropic

$1.00 / $5.00

$0.25 / $1.50

DI DI Saver

75% in · 70% out

Claude Opus 4.x

Anthropic

$5.00 / $25.00

$3.00 / $10.00

DI DI Max

40% in · 60% out

gpt-5.5

OpenAI

$5.00 / $30.00

$3.00 / $10.00

DI DI Max

40% in · 67% out

Illustrative. Figures compare DI's published per-1M effort-tier rates against each vendor's published list price, accessed 2026-06-14 and subject to change. DI bills per token at the rate of whichever model serves a request; we never disclose which model that is.

Lower published rates

DI's per-token rates are published below the frontier labs' list prices for comparable work. You don't trade rate for capability — the rate is simply lower on the meter.

Frontier strength only when it's needed

Most production traffic is simple, and the simple tail is served fast and cheap. Only the requests that actually need frontier-grade reasoning are served at the top tier — so you pay frontier rates per request, not across your whole bill.

How billing works

Pay-as-you-go, by the token

Top up with a card

Add credit when you need it and draw it down per request. No monthly minimum, no contract to negotiate.

Margin on the simple tail

Trivial requests are recognized and served fast and cheap — so your blended cost drops while the same endpoint stays capable.

Cached input costs less

When a request reuses a prompt prefix, cached input is billed at a reduced rate automatically — no cache plumbing on your side.

Questions

Good to know

Which model am I paying for?

You pay for the tokens of whichever model serves each request. The DI Model is zero-knowledge by design: you see the request type, never the specific model. That lets us keep the developer experience stable while selecting capable, economical models on your behalf.

Do I need to pick an effort level per call?

No. Balanced is the default. The effort hint is optional and per-request, so you can dial a single call toward fast or max without touching the rest of your integration.

What happens to images, PDFs, or long context?

Capability outranks the effort level. A request with an image, a document, or oversized context is promoted to a model that can handle it — even on the Fast preset — so nothing silently fails or gets truncated.

Will a renamed or unknown model id cost me an outage?

No. Unknown, legacy, and future ids resolve to a capable model instead of erroring, so a provider renaming a model does not break your code or your billing.

Ready when you are.

Start building