Overview

We're building an AI super-app — one interface where users compare the world's top models side by side instead of juggling five subscriptions. Now we're building the AI layer that makes it smarter, faster, and more useful than any single model alone.

What you'll build

Design and ship AI-powered features end-to-end — RAG pipelines, agentic workflows, tool-use systems, and structured outputs in production
Build and maintain evaluation harnesses, datasets, and quality metrics — latency, cost, accuracy, and hallucination rates across every model we serve
Own the full lifecycle of AI features from prototype to hardened, evaluated, production-grade system
Drive technical decisions on prompt design, model selection, retrieval architecture, and AI system safety

Who you are

4+ years of engineering experience with significant recent work on LLM-powered systems in production
Deep hands-on expertise in prompt engineering, RAG, function calling, and structured outputs — not just familiarity
Strong in Python and the modern AI stack — Hugging Face, LangChain/LlamaIndex, vector DBs, model APIs
You've shipped AI features that had no playbook — evaluated them, iterated on them, and owned the results

Why this

You're building the intelligence layer of a product that puts every major AI model in one place — the eval and retrieval problems here are uniquely complex
Early stage, real users — your work ships to people immediately, not into a staging environment
If you want to own AI architecture end-to-end rather than prompt-tune someone else's pipeline — this is that role