Overview
We're building an AI super-app — one interface where users compare the world's top models side by side instead of juggling five subscriptions. Now we're building the AI layer that makes it smarter, faster, and more useful than any single model alone.
What you'll build
- Design and ship AI-powered features end-to-end — RAG pipelines, agentic workflows, tool-use systems, and structured outputs in production
- Build and maintain evaluation harnesses, datasets, and quality metrics — latency, cost, accuracy, and hallucination rates across every model we serve
- Own the full lifecycle of AI features from prototype to hardened, evaluated, production-grade system
- Drive technical decisions on prompt design, model selection, retrieval architecture, and AI system safety
Who you are
- 4+ years of engineering experience with significant recent work on LLM-powered systems in production
- Deep hands-on expertise in prompt engineering, RAG, function calling, and structured outputs — not just familiarity
- Strong in Python and the modern AI stack — Hugging Face, LangChain/LlamaIndex, vector DBs, model APIs
- You've shipped AI features that had no playbook — evaluated them, iterated on them, and owned the results
Why this
- You're building the intelligence layer of a product that puts every major AI model in one place — the eval and retrieval problems here are uniquely complex
- Early stage, real users — your work ships to people immediately, not into a staging environment
- If you want to own AI architecture end-to-end rather than prompt-tune someone else's pipeline — this is that role