<aside>
</aside>
<aside>
A practical framework for building AI products from zero to production
Seven stages. Every decision, what to look for, and why it matters.
I built and refined this framework with hands-on AI product work and deep research borrowing into industry best practices industry leaders.
When to Use This
- Starting a new AI product or feature from scratch.
- Evaluating whether AI is the right solution for a problem.
- Structuring an AI product team's workflow end-to-end.
- Preparing for AI product leadership interviews.
- Auditing an existing AI product for gaps in eval, trust, or safety.
---
config:
layout: elk
elk:
nodePlacementStrategy: NETWORK_SIMPLEX
---
flowchart TD
subgraph Frame["<b>1. Frame</b>"]
direction LR
F1("Problem")
F2("Who Hurts")
F3("AI Durability Check")
F4("Smallest Proof")
F1-->F2-->F3-->F4
end
subgraph Spec["<b>2. Spec</b>"]
direction LR
S1("Inputs")
S2("Outputs")
S3("Data Strategy")
S4("Success Criteria")
S5("Model Eval")
S6("Product Metrics")
S1-->S2-->S3-->S4-->S5-->S6
end
subgraph Arch["<b>3. Architect & Design</b>"]
direction LR
A1("Agent Shape")
A2("Context Strategy")
A3("Tools")
A4("HITL Boundaries")
A5("Safety Boundaries")
A6("UX Flows")
A7("Design System")
A1-->A2-->A3-->A4-->A5-->A6-->A7
end
subgraph Build["<b>4. Build</b>"]
direction LR
B1("AI Scaffolds")
B2("Prompt Engineering")
B3("Tool Integration")
B4("Human Steers Spec")
B1-->B2-->B3-->B4
end
subgraph Eval["<b>5. Eval</b>"]
direction LR
E1("Real Inputs")
E2("Failure Modes")
E3("Prompt Gaps")
E4("Tool Gaps")
E5("Cost / Latency Gates")
E1-->E2-->E3-->E4-->E5
end
subgraph Polish["<b>6. Polish</b>"]
direction LR
P1("Explainability")
P2("Trust Signals")
P3("Edge Cases")
P4("Error States")
P5("Undo")
P1-->P2-->P3-->P4-->P5
end
subgraph Ship["<b>7. Ship & Measure</b>"]
direction LR
M1("Leading: Usage")
M2("Lagging: Outcomes")
M3("Feedback Capture")
M1-->M2-->M3
end
Frame e1@==> Spec
Spec e2@==> Arch
Arch e3@==> Build
Build e4@==> Eval
Eval e5@==> Polish
Polish e6@==> Ship
Ship e7@==> Eval
e1@{ animation: slow }
e2@{ animation: slow }
e3@{ animation: slow }
e4@{ animation: slow }
e5@{ animation: slow }
e6@{ animation: slow }
e7@{ animation: slow }
</aside>
<aside>
1. Frame
Before writing a single prompt or choosing a model, get ruthlessly clear on what you're solving and for whom. This stage prevents the most expensive mistake in AI: building something impressive that nobody needs.
Problem
- State the problem in one sentence from the user's perspective, not the team's
- A good problem statement is specific and observable: "Sales reps spend 3 hours per deal writing follow-up emails" not "We need AI in our sales workflow"
- If you can't articulate the problem without mentioning AI, you don't have a problem yet
<aside>
<img src="/icons/light-bulb_orange.svg" alt="/icons/light-bulb_orange.svg" width="40px" />
Key Principle
If you can't articulate the problem without mentioning AI, you don't have a problem yet - you have a technology looking for a home.
</aside>
Who Hurts
- Identify the specific person whose day gets better when this is solved
- Map their current workflow step by step - where is the pain, friction, or wasted time?
- Talk to them. The biggest risk in AI products is building for an imagined user instead of a real one
- Prioritize: who feels this pain most acutely and most frequently?
AI Durability Check
- Ask: "Will the next model generation solve this automatically?" If yes, don't build it
- LLMs are improving at an unprecedented rate - context windows, reasoning, multimodal capabilities all expand with each release
- Focus on problems where your value comes from domain-specific context, proprietary data, or workflow integration - not from raw model capability
- Example: building a workaround for short context windows in 2023 was wasted effort by 2024. Building deep meeting-context understanding (like Granola) was not
Smallest Proof
- Define the cheapest, fastest experiment that proves the AI can deliver value for this problem
- This is not an MVP - it's a proof of concept. Can the AI do the core job at all?
- Use the most expensive, cutting-edge model for the proof. Optimize cost later. Right now you're testing feasibility, not economics
- Set a clear pass/fail bar before running the experiment
</aside>
<aside>
2. Spec
Translate the validated problem into a precise technical and product specification. This is where most AI projects silently fail - vague specs produce vague AI behaviour.
Inputs
- Define exactly what data the AI receives for each interaction
- Map every input source: user text, uploaded files, database records, API responses, conversation history
- Specify format, quality, and volume expectations for each input
- Identify what's missing - the inputs you wish you had but don't. These gaps shape your data strategy
<aside>
<img src="/icons/light-bulb_orange.svg" alt="/icons/light-bulb_orange.svg" width="40px" />
Key Principle
- Model eval asks: "Is the AI technically performing well?"
- Product metrics ask: "Is the user's life actually better?" These are different questions.
</aside>
Outputs
- Define what the AI produces and in what format
- Be specific: "a 3-bullet summary with action items tagged by owner" not "a summary"
- Specify output constraints: length limits, required fields, tone, language
- Define what a bad output looks like - this is as important as defining a good one
Data Strategy
- Audit what data you have, what you need, and the gap between them
- Address data quality: AI output quality is bounded by input data quality. Garbage in, garbage out applies more to AI than to any other technology
- Plan for data labeling, cleaning, and enrichment. Budget for it - this is often 60-80% of the work in AI projects
- Consider data freshness: how often does the data change, and how does stale data affect output quality?
Success Criteria
- Define "done" in measurable terms before building anything
- Set thresholds: "The AI must correctly extract the right action items in 85%+ of cases" not "The AI should be good at extracting action items"
- Align success criteria with the user's definition of value, not the team's definition of technically interesting
- Include failure tolerance: what error rate is acceptable, and what types of errors are unacceptable?
Model Eval
- Design how you'll evaluate the AI model's raw performance, separate from product metrics
- Build eval datasets: curated sets of inputs with known-good outputs that you run against every change
- Eval types to consider: accuracy, consistency (same input should produce similar output), latency, hallucination rate, instruction-following
- Automate evals in CI/CD - manual spot-checking doesn't scale and misses regressions
Product Metrics
- Define the business outcomes the AI product must move - these are different from model eval
- Model eval asks: "Is the AI technically performing well?" Product metrics ask: "Is the user's life actually better?"
</aside>
<aside>
</aside>
<aside>
</aside>
<aside>
</aside>
<aside>
</aside>
<aside>
</aside>
<aside>
</aside>