Why Direct Inference
The endpoint that does the model market so you don’t.
The old way means choosing a model for every task, building retries and failover, and re-touching code each time the backend moves. Direct Inference puts model orchestration, smart routing, and task-complexity scaling behind one endpoint: the best model is served on every request and your existing code keeps working unchanged. That simplicity is the product — your integration stays trivial while the model market churns.
Three ways to get a model
Smart routing should not become your product surface.
Choosing a model used to mean wiring it yourself. The newest tools choose for you — but only after you turn on a router, write routing rules, and live in a model picker. Direct Inference is the step past that: smart routing is built in, with nothing to choose, enable, or configure.
The old way
Wire it yourself
Pick a model for every task, build your own retries and failover, and re-touch code each time the market moves.
You own the model matrix, the plumbing, and every migration.
The current wave
Add a smart router
A router picks a model for you — once you enable it, configure routing rules, and select it from a model picker.
Choosing is faster, but it's still a router to turn on, rules to maintain, and a picker to live in.
Direct Inference
Stop choosing entirely
One endpoint covers your use cases the way a frontier lab does. Smart routing and model orchestration are built in — no picker to live in, no rules to write.
Nothing to configure. Change one line and one key — and task-complexity scaling keeps optimizing every request for you after that.
The old way, in detail
Wire it yourself vs. one endpoint
Same outcome you’d hand-build — without building, configuring, or maintaining any of it.
The advantage
Not choosing is the feature, not a limitation.
Letting the model decision go isn't something you give up — it's what you gain: less to maintain, more we can optimize on your behalf, and one endpoint that doesn't drift out from under you.
An integration that can't drift
There are no model names in your code to go stale, so a rename or retirement upstream can't quietly break a branch you forgot you wrote.
We optimize so you don't have to
Because we choose per request, we continually move traffic for quality, latency, price, and availability on your behalf — and keep doing it as models and prices change. No slider to tune, no migration to run, ever.
One endpoint, not a shopping list
You commit to one durable endpoint instead of to any single lab's release cycle. Keeping up with the model market stays our job, not yours.
Operate with confidence
You still see everything that's yours.
You never have to track which model served a request — and everything else is fully visible: usage, costs, request mix, and per-application breakdowns, with hard caps you control.
Usage by workload
See how your traffic splits across the kinds of work you send — chat, documents, vision, code, reasoning — so cost and volume break down by what you're actually doing.
Per-application attribution
Traffic segments by application automatically from your request headers, so one key can power many surfaces and still break down cleanly.
Request traces
Inspect individual requests — tokens, latency, cost, and the detected request type — for the debugging visibility production actually needs.
Hard spend caps
Per-key and account-level ceilings are enforced in the request path. Past the cap, spend fails closed instead of running up a bill.
See what each call needs
The playground shows, in real time, how each request is handled — so you can watch the endpoint do the work you no longer have to.
Pay-as-you-go balance
Top up with a card and draw it down per request, with a low-balance signal before anything stalls. No seats, no minimum, no contract.
The engine never stops
And it keeps getting cheaper and smarter on its own.
Removing the model layer isn't a one-time win. The engine inside the endpoint keeps tuning every request for cost, quality, and availability — so the integration you ship today improves without you touching it.
Costs keep falling
Every request is served on the most cost-efficient capable path, and repeated context is discounted. As the model market gets cheaper, so does your bill — no renegotiation, no migration, nothing to switch on.
Quality keeps rising
Each request is served by the best available model for its shape. When a stronger model ships, your hardest traffic is already on it — no prompt rewrite, no model id to bump.
New models, absorbed for you
Releases, renames, and retirements happen behind the endpoint. You never track the model market, maintain routing rules, or run another migration.
Durability
A surface that outlasts the model market.
Improves without your involvement
Each new capable model can be folded in behind the endpoint. You inherit the upgrade without a migration, a model swap, or a release-note review.
Absorbs churn instead of forwarding it
Renames, retirements, price changes, and outages are ours to absorb — not new branches in your application code.
No lock-in to any one lab
One endpoint speaks the OpenAI, Anthropic, and Gemini SDK shapes, so your product never rides a single vendor's release cycle.
Stop integrating against the model market.
Point one client at one endpoint and let the backend stay our problem. Your existing code keeps working untouched; the churn stays on our side of the line.