All roles
AI/MLFull-time

Machine Learning Engineer

Advance the systems that classify requests, evaluate answer quality, and keep code, reasoning, vision, document, and structured-output traffic served by the right capability.

You will improve the intelligence layer that understands request shape and measures whether the endpoint served it well.

About Direct Inference

Direct Inference is the endpoint that does everything frontier models can do. Customers bring the SDK and model id they already use; Direct Inference handles capability, quality, cost, latency, failover, and provider churn behind the scenes.

The important product constraint is zero-knowledge: customers never see which model, provider, or version served a request. That lets them build on a stable surface while the model market keeps moving underneath it.

What you'll own

  • Improve request classification across code, JSON, document, vision, long-context, reasoning, flash, and pro traffic.
  • Build evaluation datasets, scoring workflows, and regression checks for answer quality.
  • Analyze production traces and failure cases while preserving privacy and the zero-knowledge contract.
  • Partner with inference engineering to translate eval findings into serving improvements.
  • Develop tools that compare quality, cost, latency, and capability across candidate serving paths.

Projects you might ship

  • Build a durable eval set for one high-value request slice and wire it into a repeatable scoring workflow.
  • Improve request-type classification by finding ambiguous production examples and turning them into tested policy behavior.
  • Create analysis tooling that compares frontier baselines, serving candidates, and Direct Inference outcomes without exposing private customer data.

What we're looking for

  • You have experience with ML systems, evals, classifiers, LLM applications, ranking, or model-quality measurement.
  • You can design practical evaluations that are useful to engineering, not just nice-looking benchmarks.
  • You are comfortable with Python and production engineering constraints.
  • You can reason about model behavior, product outcomes, and data quality together.
  • You communicate uncertainty clearly and know when a metric is lying.

Nice to have

  • Experience with Langfuse, prompt/eval workflows, human review pipelines, or model-comparison tooling.
  • Practical familiarity with coding-agent, document, vision, structured-output, or reasoning workloads.
  • A taste for evals that are boring, reusable, and operationally useful.

Your first 90 days

  • Improve or expand one request-type quality evaluation suite.
  • Trace a quality regression from example failures to a concrete serving or classifier fix.
  • Add an eval workflow that helps future model or policy changes ship with confidence.

Benefits & support

Built for people doing serious work in a small team.

Competitive salary and meaningful equity
Remote-first work with San Francisco collaboration space
Medical, dental, and vision coverage
Home-office and developer-tooling budget
Quarterly team working sessions and offsites
Direct access to customers and production product decisions

Interview process

A direct loop with the people doing the work.

01

Intro

A focused conversation about your background, what you want to build, and where this role should create leverage.

02

Technical

A practical working session around the kind of problem this role owns. We prefer realistic systems over puzzle interviews.

03

Team

Meet the people you would work with across product, engineering, reliability, and customer-facing work.

04

Offer

We align on scope, compensation, start timing, and the first problems you would take on.

Application

Apply for Machine Learning Engineer.

Share the practical context we should know before the first conversation. We read applications for ownership, clarity, and evidence of shipped work.

Voluntary self-identification

These questions are optional and never affect hiring decisions. Choose “Prefer not to say” anywhere you would rather not answer.

Required fields are marked with an asterisk.