<aside>

Frame | Spec | Architect & Design | Build | Eval | Polish | Ship & Measure | Feedback Loop

</aside>

<aside>

A practical framework for building AI products from zero to production

Seven stages. Every decision, what to look for, and why it matters.

I built and refined this framework with hands-on AI product work and deep research borrowing into industry best practices industry leaders.

When to Use This

Starting a new AI product or feature from scratch.
Evaluating whether AI is the right solution for a problem.
Structuring an AI product team's workflow end-to-end.
Preparing for AI product leadership interviews.
Auditing an existing AI product for gaps in eval, trust, or safety.

---
config:
  layout: elk
  elk:
    nodePlacementStrategy: NETWORK_SIMPLEX
---
flowchart TD
subgraph Frame["<b>1. Frame</b>"]
direction LR
F1("Problem")
F2("Who Hurts")
F3("AI Durability Check")
F4("Smallest Proof")
F1-->F2-->F3-->F4
end

subgraph Spec["<b>2. Spec</b>"]
direction LR
S1("Inputs")
S2("Outputs")
S3("Data Strategy")
S4("Success Criteria")
S5("Model Eval")
S6("Product Metrics")
S1-->S2-->S3-->S4-->S5-->S6
end

subgraph Arch["<b>3. Architect & Design</b>"]
direction LR
A1("Agent Shape")
A2("Context Strategy")
A3("Tools")
A4("HITL Boundaries")
A5("Safety Boundaries")
A6("UX Flows")
A7("Design System")
A1-->A2-->A3-->A4-->A5-->A6-->A7
end

subgraph Build["<b>4. Build</b>"]
direction LR
B1("AI Scaffolds")
B2("Prompt Engineering")
B3("Tool Integration")
B4("Human Steers Spec")
B1-->B2-->B3-->B4
end

subgraph Eval["<b>5. Eval</b>"]
direction LR
E1("Real Inputs")
E2("Failure Modes")
E3("Prompt Gaps")
E4("Tool Gaps")
E5("Cost / Latency Gates")
E1-->E2-->E3-->E4-->E5
end

subgraph Polish["<b>6. Polish</b>"]
direction LR
P1("Explainability")
P2("Trust Signals")
P3("Edge Cases")
P4("Error States")
P5("Undo")
P1-->P2-->P3-->P4-->P5
end

subgraph Ship["<b>7. Ship & Measure</b>"]
direction LR
M1("Leading: Usage")
M2("Lagging: Outcomes")
M3("Feedback Capture")
M1-->M2-->M3
end

Frame e1@==> Spec
Spec e2@==> Arch
Arch e3@==> Build
Build e4@==> Eval
Eval e5@==> Polish
Polish e6@==> Ship
Ship e7@==> Eval

e1@{ animation: slow }
e2@{ animation: slow }
e3@{ animation: slow }
e4@{ animation: slow }
e5@{ animation: slow }
e6@{ animation: slow }
e7@{ animation: slow }

</aside>

<aside>

1. Frame

Before writing a single prompt or choosing a model, get ruthlessly clear on what you're solving and for whom. This stage prevents the most expensive mistake in AI: building something impressive that nobody needs.

Problem

State the problem in one sentence from the user's perspective, not the team's
A good problem statement is specific and observable: "Sales reps spend 3 hours per deal writing follow-up emails" not "We need AI in our sales workflow"
If you can't articulate the problem without mentioning AI, you don't have a problem yet

Key Principle

If you can't articulate the problem without mentioning AI, you don't have a problem yet - you have a technology looking for a home.

</aside>

Who Hurts

Identify the specific person whose day gets better when this is solved
Map their current workflow step by step - where is the pain, friction, or wasted time?
Talk to them. The biggest risk in AI products is building for an imagined user instead of a real one
Prioritize: who feels this pain most acutely and most frequently?

AI Durability Check

Ask: "Will the next model generation solve this automatically?" If yes, don't build it
LLMs are improving at an unprecedented rate - context windows, reasoning, multimodal capabilities all expand with each release
Focus on problems where your value comes from domain-specific context, proprietary data, or workflow integration - not from raw model capability
Example: building a workaround for short context windows in 2023 was wasted effort by 2024. Building deep meeting-context understanding (like Granola) was not

Smallest Proof

Define the cheapest, fastest experiment that proves the AI can deliver value for this problem
This is not an MVP - it's a proof of concept. Can the AI do the core job at all?
Use the most expensive, cutting-edge model for the proof. Optimize cost later. Right now you're testing feasibility, not economics
Set a clear pass/fail bar before running the experiment </aside>

<aside>

2. Spec

Translate the validated problem into a precise technical and product specification. This is where most AI projects silently fail - vague specs produce vague AI behaviour.

Inputs

Define exactly what data the AI receives for each interaction
Map every input source: user text, uploaded files, database records, API responses, conversation history
Specify format, quality, and volume expectations for each input
Identify what's missing - the inputs you wish you had but don't. These gaps shape your data strategy

Key Principle

Model eval asks: "Is the AI technically performing well?"
Product metrics ask: "Is the user's life actually better?" These are different questions. </aside>

Outputs

Define what the AI produces and in what format
Be specific: "a 3-bullet summary with action items tagged by owner" not "a summary"
Specify output constraints: length limits, required fields, tone, language
Define what a bad output looks like - this is as important as defining a good one

Data Strategy

Audit what data you have, what you need, and the gap between them
Address data quality: AI output quality is bounded by input data quality. Garbage in, garbage out applies more to AI than to any other technology
Plan for data labeling, cleaning, and enrichment. Budget for it - this is often 60-80% of the work in AI projects
Consider data freshness: how often does the data change, and how does stale data affect output quality?

Success Criteria

Define "done" in measurable terms before building anything
Set thresholds: "The AI must correctly extract the right action items in 85%+ of cases" not "The AI should be good at extracting action items"
Align success criteria with the user's definition of value, not the team's definition of technically interesting
Include failure tolerance: what error rate is acceptable, and what types of errors are unacceptable?

Model Eval

Design how you'll evaluate the AI model's raw performance, separate from product metrics
Build eval datasets: curated sets of inputs with known-good outputs that you run against every change
Eval types to consider: accuracy, consistency (same input should produce similar output), latency, hallucination rate, instruction-following
Automate evals in CI/CD - manual spot-checking doesn't scale and misses regressions

Product Metrics

Define the business outcomes the AI product must move - these are different from model eval
Model eval asks: "Is the AI technically performing well?" Product metrics ask: "Is the user's life actually better?" </aside>

<aside>

</aside>

<aside>

</aside>

<aside>

</aside>

<aside>

</aside>

<aside>

</aside>

<aside>

</aside>

**Frame | Spec | Architect & Design | Build | Eval | Polish | Ship & Measure | Feedback Loop**

A practical framework for building AI products from zero to production

When to Use This

1. Frame

Problem

Who Hurts

AI Durability Check

Smallest Proof

2. Spec

Inputs

Outputs

Data Strategy

Success Criteria

Model Eval

Product Metrics

Frame | Spec | Architect & Design | Build | Eval | Polish | Ship & Measure | Feedback Loop