Pricing
Pay by the token. Keep the experience simple.
No subscription, no per-seat fees. Direct Inference keeps the simple majority of requests economical while preserving frontier-class paths for harder work. A single effort hint biases any call toward latency, cost, or quality — no code rewrite, no model swap — and capability handling promotes a request when it needs more.
DI Saver
effort: low
$1.50 / 1M output · from
Summaries, classification, extraction, and rewrites — the simple tail of traffic.
Start buildingDI (Auto)
Defaulteffort: auto
priced per request — pay for what each one needs
The default. Priced per request — you pay for what each one needs, never a model you didn't choose.
Start buildingDI Max
effort: high
$10.00 / 1M output · from
Hard reasoning, long context, and answers that need frontier-grade quality.
Start buildingFigures are representative starting rates per 1M tokens. You’re billed per token at the rate of whichever model serves a given request; the effort hint biases that choice, and capability handling (vision, document, long context) can promote a call above its effort level when the request requires it.
Enterprise
For platform and engineering teams running production AI at scale — and the security, finance, and procurement partners who sign off on it.
What you actually pay
Your cost follows your traffic, not your worst-case model.
Most production traffic is simple. The simple tail is served at Saver rates while only the hard tail spends Max — so your bill tracks the work, not a top-tier rate on every call. Here is an illustrative mix.
| Request mix | Share | In /1M | Out /1M |
|---|---|---|---|
DI Saver Simple tail — classification, extraction, short chat | 70% | $0.25 | $1.50 |
DI (Auto) Everyday assistant and product traffic | 20% | Variable | Variable |
DI Max Hard reasoning, long context, frontier-grade answers | 10% | $3.00 | $10.00 |
92%
lower input cost
85%
lower output cost
Illustrative only. Shares are an example traffic mix; the simple majority runs at Saver — a fraction of an all-Max bill — while the default DI tier handles the middle at a per-request rate. Your actual mix depends on your traffic, and capability handling can still promote a request when it needs more, so a cheap simple tail never means a hard request gets shortchanged.
Versus frontier models
Published rates roughly 40–67% below frontier models.
A straight list-price comparison: DI's published per-1M effort-tier rates against the published list prices for frontier models like Claude Opus, Claude Sonnet, and the current GPT-5.x family. Same workload, lower rate on the meter — DI never discloses which model serves a given request.
| Frontier model (list price) | Their /1M | DI tier /1M | DI rate is lower by |
|---|---|---|---|
Claude Haiku 4.5 Anthropic | $1.00 / $5.00 | $0.25 / $1.50 DI DI Saver | 75% in · 70% out |
Claude Opus 4.x Anthropic | $5.00 / $25.00 | $3.00 / $10.00 DI DI Max | 40% in · 60% out |
gpt-5.5 OpenAI | $5.00 / $30.00 | $3.00 / $10.00 DI DI Max | 40% in · 67% out |
Illustrative. Figures compare DI's published per-1M effort-tier rates against each vendor's published list price, accessed 2026-06-14 and subject to change. DI bills per token at the rate of whichever model serves a request; we never disclose which model that is.
Lower published rates
DI's per-token rates are published below the frontier labs' list prices for comparable work. You don't trade rate for capability — the rate is simply lower on the meter.
Frontier strength only when it's needed
Most production traffic is simple, and the simple tail is served fast and cheap. Only the requests that actually need frontier-grade reasoning are served at the top tier — so you pay frontier rates per request, not across your whole bill.
How billing works
Pay-as-you-go, by the token
Top up with a card
Add credit when you need it and draw it down per request. No monthly minimum, no contract to negotiate.
Margin on the simple tail
Trivial requests are recognized and served fast and cheap — so your blended cost drops while the same endpoint stays capable.
Cached input costs less
When a request reuses a prompt prefix, cached input is billed at a reduced rate automatically — no cache plumbing on your side.
Questions
Good to know
Which model am I paying for?
You pay for the tokens of whichever model serves each request. The DI Model is zero-knowledge by design: you see the request type, never the specific model. That lets us keep the developer experience stable while selecting capable, economical models on your behalf.
Do I need to pick an effort level per call?
No. Balanced is the default. The effort hint is optional and per-request, so you can dial a single call toward fast or max without touching the rest of your integration.
What happens to images, PDFs, or long context?
Capability outranks the effort level. A request with an image, a document, or oversized context is promoted to a model that can handle it — even on the Fast preset — so nothing silently fails or gets truncated.
Will a renamed or unknown model id cost me an outage?
No. Unknown, legacy, and future ids resolve to a capable model instead of erroring, so a provider renaming a model does not break your code or your billing.
Ready when you are.
Start building