PART 9 โ Agent Ops: Production Lifecycle โ
Agents in production are not static software. They are living systems that need continuous management.
Why Traditional DevOps Fails โ
Traditional software test: assert output == expected_output PASS/FAIL
Agent test: Is this response "good"? depends...Agents are stochastic โ same input, different output on different runs. You cannot unit-test them like a pure function.
The Agent Ops Stack โ
1. Define Metrics (Business KPIs, Not Just Technical) โ
- Goal completion rate, user satisfaction, task latency, cost per interaction
- Revenue impact, conversion, retention
- NOT just: tokens per second, uptime
2. Build Evaluation Datasets ("Golden Sets") โ
- Sample from real production interactions
- Cover the full range of use cases + edge cases
- Domain expert review before using as ground truth
- Treat the eval dataset as your most important asset โ it is the ground truth for all decisions
3. Use an LLM-as-Judge โ
You cannot assert exact match โ use a model to score quality:
"Does this answer correctly resolve the user's intent? Score 1-5 with reasoning."Automated scoring at scale. No human reviewer needed for every response.
4. Metrics-Driven Deployment โ
Deploy new model/prompt version
|
Run against full eval set
|
Compare scores to production version
|
Go/No-Go decision (score regression = blocker)5. Trace with OpenTelemetry โ
For debugging (not performance dashboards). See:
- Exact prompt sent to the model
- Model reasoning steps
- Tool chosen + parameters passed
- Raw tool results
- Where in the loop the failure happened
Platforms: Google Cloud Trace, LangFuse, Datadog
6. Close the Feedback Loop โ
User reports bad answer
|
Replicate the failure
|
Add to eval dataset
|
Every bug becomes a test caseThe CI/CD for Agents โ
Code change / New model / Prompt update
|
Run against eval dataset
|
LLM-as-Judge scores comparison
New version vs. Production version
|
Latency + Cost + Quality all pass?
|
Deploy to production
|
Monitor -> Collect feedback -> Update eval datasetThe Complete Production Architecture โ
PRODUCTION AI AGENT SYSTEM
+-------------------------------------------------------------------+
| USER / OTHER SYSTEMS |
| Web UI Mobile A2A client API |
+----------------------------+--------------------------------------+
|
+----------------------------v--------------------------------------+
| AGENT RUNTIME (ADK / Vertex AI Agent Engine) |
| |
| +------------------+ +---------------------------------------+ |
| | RUNNER | | SERVICES | |
| | (Orchestrator) | | SessionService -> PostgreSQL/Vertex | |
| | Event Loop | | ArtifactService -> GCS/S3 | |
| | Context mgmt | | MemoryService -> Mem0/Vector DB | |
| +-------+----------+ +---------------------------------------+ |
| | |
| +-------v-------------------------------------------------+ |
| | AGENT PIPELINE | |
| | OrchestratorAgent (Coordinator) | |
| | +-- ResearchAgent (SequentialAgent) | |
| | | +-- before_tool_callback (auth check) | |
| | | +-- Tool: RAG query | |
| | | +-- Tool: NL2SQL | |
| | +-- DraftAgent (LlmAgent) | |
| | | +-- before_model_callback (PII scrub) | |
| | | +-- Tool: document_writer | |
| | +-- ReviewAgent (LlmAgent) | |
| | +-- after_model_callback (content filter) | |
| +---------------------------------------------------------+ |
+----------------------------+--------------------------------------+
|
+----------------------------v--------------------------------------+
| INFERENCE LAYER |
| +----------------------+ +-------------------------------+ |
| | CACHING STACK | | MODEL ROUTING | |
| | Semantic Cache | | Complex -> Gemini 2.5 Pro | |
| | Prompt Cache | | Simple -> Gemini 2.5 Flash | |
| | KV Cache (GPU) | | Images -> Specialized APIs | |
| +----------------------+ +-------------------------------+ |
+----------------------------+--------------------------------------+
|
+----------------------------v--------------------------------------+
| AGENT OPS |
| Eval datasets LLM judges OpenTelemetry traces |
| CI/CD pipeline Model upgrade automation |
| Human feedback loop -> new test cases |
+-------------------------------------------------------------------+Key Ops Rules โ
- Models rotate every 6 months. You need a CI/CD pipeline that can swap the model without architectural changes.
- Every production failure becomes a test case. The eval dataset grows with every bug.
- Score regression = deployment blocker. Treat it like a failing build.
- Monitor cost per interaction, not just quality. Quality at 10ร cost is not production-ready.
Recall Hook โ
For agents, "testing" means evaluation datasets + LLM judges, not assert statements. Every bug is a new test case.
Sources โ
- Google ADK Whitepaper: Introduction to Agents โ Agent Ops section
- Google ADK Documentation: Evaluation, Tracing
- LangFuse: langfuse.com
Have an eval strategy that actually works in production? Share it here โ specific frameworks and metrics welcome.