Senior Inference Engineer
Own and extend the serving engine: the quality, latency, health, and price signals that decide how every request is served.
You will own the engine that decides how Direct Inference serves each request while keeping that decision invisible to the customer.
About Direct Inference
Direct Inference is the endpoint that does everything frontier models can do. Customers bring the SDK and model id they already use; Direct Inference handles capability, quality, cost, latency, failover, and provider churn behind the scenes.
The important product constraint is zero-knowledge: customers never see which model, provider, or version served a request. That lets them build on a stable surface while the model market keeps moving underneath it.
What you'll own
- Design and improve serving decisions across request types such as code, JSON, document, vision, long context, reasoning, flash, and pro traffic.
- Build quality and regression evaluation loops that catch when a serving path gets worse before customers feel it.
- Improve latency and cost behavior for routine traffic without weakening hard or high-stakes requests.
- Strengthen fallback, promotion, and provider-health logic so the endpoint keeps serving through upstream change.
- Collaborate with observability and product surfaces so internal decisions can be inspected by operators without leaking to customers.
- Turn production traces and evals into concrete engine improvements.
Projects you might ship
- Ship an eval-backed improvement to one request-type path, such as code, structured output, long-context, or document handling.
- Design a provider-health and quality signal that improves serving decisions during upstream drift or degradation.
- Build a regression harness that catches a serving-policy change before it hurts latency, quality, or cost for a real workload.
What we're looking for
- You have built or operated production ML, search, ranking, model-serving, or latency-sensitive distributed systems.
- You are comfortable with evaluation design: offline datasets, online traces, quality scoring, and failure analysis.
- You can reason about cost, latency, and quality as one system rather than isolated metrics.
- You write careful code and value tests for behavior that is easy to regress.
- You like owning hard tradeoffs and explaining them simply.
Nice to have
- Experience with LLM evals, model-serving platforms, multi-provider inference, ranking systems, or production experimentation.
- Comfort reading production traces and turning ambiguous quality failures into measurable fixes.
- Enough product sense to know when a serving improvement needs better operator visibility, not only better code.
Your first 90 days
- Map the current serving decision path and identify the highest-leverage quality or latency improvement.
- Ship a measurable improvement to one request-type path or fallback behavior.
- Add or strengthen an eval or observability loop that future engine changes can rely on.
Benefits & support
Built for people doing serious work in a small team.
Interview process
A direct loop with the people doing the work.
Intro
A focused conversation about your background, what you want to build, and where this role should create leverage.
Technical
A practical working session around the kind of problem this role owns. We prefer realistic systems over puzzle interviews.
Team
Meet the people you would work with across product, engineering, reliability, and customer-facing work.
Offer
We align on scope, compensation, start timing, and the first problems you would take on.
Application
Apply for Senior Inference Engineer.
Share the practical context we should know before the first conversation. We read applications for ownership, clarity, and evidence of shipped work.
More openings
Other ways to build Direct Inference.
Forward Deployed Engineer
Engineering · Remote / San Francisco, CA
Work directly with high-intent customers to get production AI workloads running on Direct Inference, then bring the sharp edges back into the product and serving engine.
Platform Reliability Engineer (SRE)
Infrastructure · Remote
Keep one endpoint dependable across a churning set of upstream providers: failover, rate-limit absorption, and the spend caps that fail closed.
Full Stack Engineer
Engineering · Remote / San Francisco, CA
Build the product and platform surfaces that make Direct Inference feel like one dependable endpoint, from API workflows to dashboard tools used by production teams.