← All case studies

Epic! for Kids · February 2023 – May 2024

ML Infrastructure Rescue

Production ML platform ownership across cost, search, recommendations, and reliability.

Role: Senior Research Engineer

ML infraKubernetesSearchRecommendationsCost optimization

Executive summary

Took ownership of production ML systems after layoffs and reduced cost, complexity, and operational risk.

10x ML platform cost reduction100x Kubernetes pod usage reduction99% spot instance error reduction50% Docker build-time reduction

Problem and constraints

The ML platform needed ownership across discovery, recommendations, search, Docker builds, Kubernetes usage, spot instance stability, and product experiments.

  • Keep production systems running
  • Reduce operational cost
  • Improve search relevance
  • Support backend/frontend/analytics needs

Architecture

01ML services
02Docker build pipeline
03Kubernetes deployment
04Spot compute
05Elasticsearch autocomplete
06Recommendations
07A/B testing

Decision Theater

Decision fork

Scale existing infrastructure vs simplify it

Cost and reliability problems were symptoms of complexity, not just capacity.

Scale existing pattern

Pros
  • Less migration work
Cons
  • Preserves cost and fragility

Simplify usage

Pros
  • Lower cost
  • Smaller failure surface
Cons
  • Requires deeper investigation

Chosen: Simplify infrastructure. Reducing unnecessary infrastructure can be more powerful than tuning it.

Evaluation and reliability

  • Measured search relevance against the prior autocomplete solution.
  • Tracked infrastructure cost and operational error reduction.

Observability and debugging

  • Used production metrics and error behavior to prioritize high-impact infrastructure fixes.

Reflection

Senior ML ownership often means cleaning up cost, reliability, and product feedback loops, not only training models.

This case study uses sanitized architecture and representative examples. It excludes confidential prompts, customer data, proprietary datasets, private implementation details, and internal traces.