Platform Reliability Engineer (SRE)
Keep one endpoint dependable across a churning set of upstream providers: failover, rate-limit absorption, and the spend caps that fail closed.
You will keep Direct Inference dependable while the model market, providers, customer traffic, and production workloads keep changing underneath it.
About Direct Inference
Direct Inference is the endpoint that does everything frontier models can do. Customers bring the SDK and model id they already use; Direct Inference handles capability, quality, cost, latency, failover, and provider churn behind the scenes.
The important product constraint is zero-knowledge: customers never see which model, provider, or version served a request. That lets them build on a stable surface while the model market keeps moving underneath it.
What you'll own
- Own production reliability across deploys, health checks, alerting, incident response, and post-incident hardening.
- Build provider-health automation and capacity controls that keep customer workloads served through upstream instability.
- Improve fail-closed spend controls, abuse handling, and runaway workload protection.
- Harden observability for latency, error rate, cost, saturation, and customer-impact analysis.
- Partner with inference engineering on fallback behavior and with product engineering on operator-facing reliability views.
- Write runbooks, tests, and automation that make production operations repeatable.
Projects you might ship
- Improve provider-health monitoring so the system can react faster to degraded upstream paths.
- Harden the deploy and verification loop so releases can be proven from workflow, running image, healthcheck, and production state.
- Build a capacity or spend-control guardrail that prevents one customer, provider event, or runaway workload from affecting everyone else.
What we're looking for
- You have operated production systems with real uptime, latency, or financial consequences.
- You are comfortable debugging across application code, infrastructure, networking, deploy tooling, logs, and metrics.
- You know how to design alerts and runbooks that help during incidents instead of adding noise.
- You can automate recurring operational work without losing sight of the customer impact.
- You care about reliability as product quality, not just infrastructure hygiene.
Nice to have
- Experience operating API infrastructure, model-serving systems, payment/billing paths, or high-traffic developer platforms.
- Comfort with incident command, postmortems, SLOs, tracing, metrics, and alert design.
- A bias toward simple operational tools that make the right thing easy during pressure.
Your first 90 days
- Own one reliability improvement from production signal to deployed fix.
- Improve a runbook, alert, or health check so incidents get clearer and shorter.
- Build a concrete capacity, provider-health, or spend-control hardening project.
Benefits & support
Built for people doing serious work in a small team.
Interview process
A direct loop with the people doing the work.
Intro
A focused conversation about your background, what you want to build, and where this role should create leverage.
Technical
A practical working session around the kind of problem this role owns. We prefer realistic systems over puzzle interviews.
Team
Meet the people you would work with across product, engineering, reliability, and customer-facing work.
Offer
We align on scope, compensation, start timing, and the first problems you would take on.
Application
Apply for Platform Reliability Engineer (SRE).
Share the practical context we should know before the first conversation. We read applications for ownership, clarity, and evidence of shipped work.
More openings
Other ways to build Direct Inference.
Forward Deployed Engineer
Engineering · Remote / San Francisco, CA
Work directly with high-intent customers to get production AI workloads running on Direct Inference, then bring the sharp edges back into the product and serving engine.
Senior Inference Engineer
Engineering · Remote / San Francisco
Own and extend the serving engine: the quality, latency, health, and price signals that decide how every request is served.
Full Stack Engineer
Engineering · Remote / San Francisco, CA
Build the product and platform surfaces that make Direct Inference feel like one dependable endpoint, from API workflows to dashboard tools used by production teams.