Skip to content

PART 10 โ€” The Developer Mindset Shift โ€‹

From the Google ADK whitepaper โ€” the most important mental model.


The Fundamental Shift โ€‹

Old paradigm (bricklayer): You define every logical step precisely. Code is deterministic.

New paradigm (director): You set the scene (system prompt), cast the roles (tools/agents), provide context (data). Your job is to guide an autonomous actor to deliver the performance.

BRICKLAYER                          DIRECTOR
"Step 1: do X"                      "You are a helpful agent.
"Step 2: if Y, do Z"                 Your goal is X.
"Step 3: else do W"                  You have these tools.
"Step 4: format as..."               Make good decisions."

The Determinism Table โ€‹

Traditional SoftwareAI Systems
Deterministic outputProbabilistic output
Tests pass or failEvals score on a spectrum
Fix bugs by reading codeFix failures by improving prompts + data
Version control = codeVersion control = code + prompts + evals
Deploy once, monitor for errorsDeploy continuously, monitor for quality drift

Prompts Are Code โ€‹

Treat prompts with the same rigour as code:

  • Version control them
  • Review them in PRs
  • Test them against eval datasets
  • Document why they changed

A prompt change is a behaviour change. Merge it like one.


Comprehensive Evaluations Outweigh the Prompt โ€‹

"The caveat: Comprehensive evaluations outweigh the prompt. You can't just write a good prompt and ship. You must measure, evaluate, and iterate on agent behavior systematically." โ€” Google ADK Whitepaper

A good prompt with no evals is a guess. A mediocre prompt with rigorous evals is a system you can improve.

The eval dataset is the most important asset in an AI product. It is the ground truth for all decisions. Build it from day one, grow it with every production failure.


The Non-Determinism Trap โ€‹

Common Mistake

Building integration tests that assert exact output. "The agent should respond with exactly: 'Your order has been processed.'" This will be brittle and fail constantly. Test semantics, not strings.


What Good Looks Like โ€‹

A mature AI engineering team:

  • Ships eval improvements before shipping prompt improvements
  • Has a golden dataset that grows with every production failure
  • Treats a score drop in CI as a blocker, not a warning
  • Can answer "what changed?" when quality drops in production
  • Has a CI/CD pipeline that can swap models without architectural overhaul (because models are superseded every 6 months)

Recall Hook โ€‹

You are not writing code that does things. You are writing specifications that guide a probabilistic system. The discipline is different. The rigour is the same.


Sources โ€‹

What mental model shift made the biggest difference for your team? Add it to this page โ€” reviewed before publishing.

Built from real deployments. Not theory.