Week 08: Observability, Evals & Capstone
What You'll Learn
The final differentiator. You'll instrument your system with Langfuse and LangSmith, write evals with pytest, set up alerting, and ship your capstone project — a complete production AI system that demonstrates real engineering judgment.
Session Schedule
| Day | Time | Focus |
|---|---|---|
| Saturday | 8:00 - 11:00 PM WAT | Observability & Evaluation |
| Sunday | 8:00 - 11:00 PM WAT | Capstone Review & Presentations |
Pre-Requisites
- ALL weeks completed
- CORTEX project at M2+
- Deployed API from Week 07
Topics Covered
Langfuse & LangSmith Tracing
Trace setup, span tracking, cost attribution, latency analysis. See exactly what your agents are doing and how much it costs.
Langfuse LangSmith TracingLLM Evaluation Frameworks
LLM-as-judge, pairwise comparison, rubric-based scoring. Measure quality systematically instead of vibes-based testing.
LLM Judge Pairwise Rubricpytest for AI Systems
Deterministic tests, golden dataset tests, flaky test handling, fixtures. Build a test suite that catches regressions before users do.
pytest Golden Dataset FixturesCost Monitoring & Optimization
Token tracking, cost per query, budget alerts, model routing for cost. Keep your AI system profitable, not just functional.
Cost Tracking Budget Model RoutingCapstone Project Review
Architecture review, code review, demo preparation, portfolio packaging. Polish your capstone into something you're proud to show employers.
Architecture Review Demo PortfolioWeekly Build: Full Production System
Ship your capstone: a complete production AI system with observability, evaluation dashboard, and portfolio-ready documentation.
Architecture
CAPSTONE SYSTEM
|
├── Agent Layer (LangGraph supervisor + specialists)
├── RAG Layer (7-layer pipeline + hybrid search)
├── Memory Layer (Redis + PostgreSQL)
├── API Layer (FastAPI + auth + rate limiting)
├── Async Layer (Celery + webhooks)
|
v
OBSERVABILITY
├── Langfuse: trace every query
├── Cost Dashboard: $/query tracking
├── Eval Suite: P@5, MRR, LLM judge
└── Alerting: latency + error rate
|
v
PORTFOLIO
├── README with architecture diagram
├── Benchmark numbers (P@5 ≥ 0.70)
├── Docker: one-command deploy
└── 3-min video demo (optional)
Key Files
| File | Purpose |
|---|---|
observability/langfuse_setup.py | Langfuse trace configuration |
observability/cost_dashboard.py | Cost tracking dashboard |
tests/evaluation/test_llm_judge.py | LLM-as-judge evaluation tests |
tests/evaluation/golden_dataset.json | Golden dataset for regression testing |
README.md | Portfolio writeup |
Resources
Required Reading
- Langfuse Documentation — Tracing & Evaluation
- LangSmith Documentation — Testing & Monitoring
- Hamel Husain — "Your AI Product Needs Evals"
Code Repository
Clone the bootcamp repo and switch to the week-08 branch:
git clone https://github.com/softbricks-academy/agentic-ai-bootcamp.git cd agentic-ai-bootcamp git checkout week-08
Session Recording
Recording will be available within 24 hours after the live session. Check the WhatsApp group for the link.
Homework
Final submissions — due by end of bootcamp.
- Complete capstone and push final code — all layers working, tests passing, deployed to cloud
- Record 3-minute demo video — use Loom to walk through your system architecture and live demo
- Write README with architecture diagram — include benchmark numbers (P@5, MRR, latency)
- Submit capstone for review — share repo link and deployed URL in the WhatsApp group