Advanced Tutorial: GEPA-Based Prompt Optimization for Customer Support Agents
View the complete implementation on GitHub: raphaelmansuy/adk_training/tutorial_implementation/tutorial_gepa_optimization
This includes working code, tests, Makefile, and both simulated and real GEPA demos.
Why: The Problem with Manual Prompt Engineeringβ
You spend hours tweaking your agent's prompt:
# Version 1: Too vague
"Help customers with refunds"
β Agent processes refunds without checking identity β
# Version 2: Added one rule
"Help customers with refunds. Verify identity first."
β Agent forgets to check 30-day return policy β
# Version 3: Added another rule...
# Version 4: Fixed edge case...
# Version 5: Still failing... π€
The cycle never ends. Each fix breaks something else. You're guessing what works.
What: GEPA Breeds Better Promptsβ
Think of GEPA like breeding dogs. You don't manually design every traitβyou let evolution do the work:
- Start with a basic prompt (mixed breed dog)
- Test it on real scenarios (dog show competitions)
- See what fails (doesn't retrieve, barks too much)
- Create variations addressing failures (breed for specific traits)
- Test again, keep the best, repeat
Result: Your prompt evolves from 0% to 100% success automatically.
Try It (Choose Your Path)β
Quick Demo (2 minutes - Simulated):
cd tutorial_implementation/tutorial_gepa_optimization
make setup && make demo
Real GEPA (5-10 minutes - Actual LLM Calls):
cd tutorial_implementation/tutorial_gepa_optimization
make setup
export GOOGLE_API_KEY="your-api-key" # Get free key from https://aistudio.google.com/app/apikey
make real-demo
What you'll see:
Simulated demo shows the concept (instant, free):
Iteration 1: COLLECT β Seed prompt 0/5 passed
Iteration 2: REFLECT β LLM identifies missing security rules
Iteration 3: EVOLVE β Generate improved prompt
Iteration 4: EVALUATE β Evolved prompt 5/5 passed
Result: 0% β 100% improvement β
Real GEPA demo shows actual evolution (uses LLM, costs $0.05-0.10):
Iteration 1: COLLECT β Agent runs with seed prompt, collects actual results
REFLECT β Gemini LLM analyzes failures
EVOLVE β Gemini generates improved prompt based on insights
EVALUATE β Test improved prompt
SELECT β Compare and choose better version
Iteration 2: Repeat with new baseline - improve further
Result: Real optimization with actual LLM reflection!
How: The 5-Step Evolution Loopβ
GEPA is simpleβjust 5 steps that repeat:
Step 1: Collect (Gather Evidence)β
Run your agent with the current prompt. Track what fails:
Test 1: Customer with wrong email β Agent approved anyway β
Test 2: Purchase 45 days ago β Agent ignored policy β
Test 3: Valid request β Agent asked unnecessary questions β
Like: Recording which puppies can't retrieve balls.
Step 2: Reflect (Understand Why)β
An LLM analyzes the failures:
"The prompt doesn't say to verify email BEFORE approving refunds.
The prompt doesn't mention the 30-day policy.
The prompt is too vague about when to ask questions."
Like: Understanding retriever dogs need strong jaw muscles and swimming ability.
Step 3: Evolve (Create Variations)β
Generate new prompts fixing the issues:
Variant A: Added "Always verify identity first"
Variant B: Added "Check 30-day return window"
Variant C: Combined both improvements
Like: Breeding puppies with stronger jaws AND better swimming.
Step 4: Evaluate (Test Performance)β
Run all variants against your test scenarios:
Seed prompt: 0/10 passed (0%)
Variant A: 4/10 passed (40%)
Variant B: 6/10 passed (60%)
Variant C: 9/10 passed (90%) β Winner!
Like: Dog show results - Variant C wins.
Step 5: Select (Keep the Best)β
Variant C becomes your new baseline. Repeat from Step 1 with tougher tests.
Iteration 1: 0% β 90%
Iteration 2: 90% β 95%
Iteration 3: 95% β 98%
...converges at 99%
Like: Each generation of puppies gets better at the specific task.
Quick Start (5 Minutes)β
# 1. Setup
cd tutorial_implementation/tutorial_gepa_optimization
make setup
# 2. See evolution in action
make demo
# 3. (Optional) Try it yourself
export GOOGLE_API_KEY="your-key"
make dev # Open localhost:8000
That's it! You've seen GEPA work and can now experiment.
Simulated Demo (make demo - 2 minutes):
- Shows GEPA concepts without LLM calls
- Instant results, no API costs
- Great for understanding the algorithm
- Uses pattern matching to simulate agent behavior
Real GEPA (make real-demo - 5-10 minutes):
- β¨ NEW: Uses actual LLM reflection with google-genai
- Gemini LLM analyzes real failures
- Generates truly optimized prompts
- Costs $0.05-$0.10 per run
- Production-ready implementation
What this tutorial provides:
- β Complete GEPA implementation (both simulated and real)
- β Working code for actual LLM-based optimization
- β Testable examples with real evaluation
- β Clear learning progression
For production GEPA optimization:
- See the full research implementation in google/adk-python
- Read comprehensive guides in
research/gepa/directory - Install DSPy:
pip install dspy-ai - Reference the GEPA paper for methodology
Performance metrics cited (10-20% improvement, 35x fewer rollouts) are from the original research paper and represent results from the full research implementation, not this simplified tutorial.
Learning Path:
- β Complete this tutorial (2 minutes) - understand concepts
- π Read
research/gepa/README.md(10 minutes) - full overview - π¬ Run research implementation (30-90 minutes) - real optimization
- π Deploy optimized prompt to production
The research implementation includes 640+ lines of production code with tau-bench integration, LLM-based reflection, Pareto frontier selection, and parallel execution. See google/adk-python for the full implementation.
Under the Hood (For the Curious)β
The demo uses a customer support agent with 3 simple tools:
- verify_customer_identity - Checks order ID + email match
- check_return_policy - Validates 30-day return window
- process_refund - Generates transaction ID
The Seed Prompt (intentionally weak):
"You are a helpful customer support agent.
Help customers with their requests.
Be professional and efficient."
The Evolved Prompt (after GEPA):
"You are a professional customer support agent.
CRITICAL: Always follow this security protocol:
1. ALWAYS verify customer identity FIRST (order ID + email)
2. NEVER process any refund without identity verification
3. Only process refunds for orders within the 30-day return window
[...detailed procedures and policies...]"
Why It Works: The evolved prompt has explicit rules the seed prompt lacked.
Run the Testsβ
make test # 34 tests validate everything works
Try It Yourselfβ
cd tutorial_implementation/tutorial_gepa_optimization
make setup && make demo
What You'll See:
6 phases showing the complete evolution cycle:
- Phase 1: Weak seed prompt
- Phase 2: Tests fail (0/5 scenarios pass)
- Phase 3: LLM reflects on failures
- Phase 4: Evolved prompt generated
- Phase 5: Tests pass (5/5 scenarios pass)
- Phase 6: Results show 0% β 100% improvement
Want Interactive Mode?
make dev # Opens ADK web interface on http://localhost:8000
Test these scenarios yourself:
- "I bought a laptop but it broke, I want a refund" (valid request)
- "Give me a refund for ORD-12345" (missing identity verification)
- "I want my money back for the phone I bought 45 days ago" (outside window)
Common Issuesβ
Import Errors?
pip install --upgrade google-genai>=1.15.0
GOOGLE_API_KEY Not Set?
export GOOGLE_API_KEY=your_actual_api_key_here
Tests Failing?
make clean && make setup && make test
Key Takeawaysβ
1. GEPA Works Because:
- Explores many prompt variations systematically
- Uses real performance data to guide evolution
- Combines successful elements from variants
- Iterates until convergence
2. Seed Prompt Matters:
- Too specific β limited evolution
- Too generic β slow convergence
- Start with reasonable baseline
3. Evaluation Dataset Quality:
- Representative scenarios = robust improvements
- Edge cases matter
- Test on new data to validate
4. Avoid These Mistakes:
- β Over-fitting to test scenarios
- β Stopping too early
- β Ignoring edge cases
- β Not validating on fresh data
Next Stepsβ
Apply GEPA to Your Own Agentsβ
Use the same pattern from this tutorial:
- Define your evaluation scenarios (real-world test cases)
- Create a weak seed prompt
- Run GEPA evolution
- Measure improvement
Validate with Standard Benchmarksβ
Instead of only custom test scenarios, validate your GEPA-optimized prompts against established benchmarks:
HELM (Holistic Evaluation of Language Models)
- Stanford's comprehensive evaluation framework
- Measures accuracy, efficiency, bias, toxicity
- 100+ scenarios across diverse domains
- Install:
pip install crfm-helm
# Evaluate your agent with HELM
helm-run --run-entries mmlu:subject=customer_service,model=your-agent \
--suite gepa-validation --max-eval-instances 100
helm-summarize --suite gepa-validation
- Built-in prompt optimization metrics
- Compare GEPA results against DSPy optimizers
- GEPA is part of the DSPy ecosystem
Why standardized benchmarks matter:
- Objective comparison against baselines
- Reproducible results across teams
- Track improvements over time
- Validate GEPA gains on industry-standard tasks
Track Metrics Over Timeβ
- Version control your evolved prompts
- A/B test in production (seed vs evolved)
- Monitor real-world performance
- Re-run GEPA when metrics drop
Deploy to Productionβ
Once validated:
- Use evolved prompt as your production baseline
- Set up monitoring dashboards
- Schedule periodic GEPA optimization
- Continuously improve based on real user data
Additional Resourcesβ
Official Research & Documentationβ
-
GEPA Research Paper - Lakshya A Agrawal et al., Stanford NLP (July 2025)
- Full paper: "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning"
- Demonstrates 10-20% improvement over GRPO with 35x fewer rollouts
- Comprehensive methodology and evaluation results
-
DSPy Framework - Stanford NLP (29.9k+ stars)
- GEPA is part of the DSPy ecosystem
- Documentation: dspy.ai
- Install:
pip install dspy-ai - Community: Discord Server
Evaluation Benchmarksβ
-
HELM - Holistic Evaluation of Language Models
- Stanford CRFM's comprehensive evaluation framework
- 100+ scenarios across diverse domains
- Leaderboards: crfm.stanford.edu/helm
-
BIG-bench - Beyond the Imitation Game
- Google's diverse task evaluation suite
- Collaborative benchmark with 200+ tasks
Related Tutorialsβ
-
Tutorial 01-35 - Foundation tutorials (prerequisites)
-
Tutorial 02: Function Tools - Tool implementation patterns
-
Tutorial 04: Sequential Workflows - Agent orchestration
-
Tutorial 30: Full-stack Integration - Production deployment
Community & Supportβ
- Questions? Open an issue on GitHub Issues
- Improvements? Submit a PR to GitHub Repo
- Discussions? Join the DSPy Discord community
π¬ Join the Discussion
Have questions or feedback? Discuss this tutorial with the community on GitHub Discussions.