Week 08: Observability, Evals & Capstone

Build: Capstone project — full production system with end-to-end observability
Overview
Topics
Weekly Build
Resources
Homework

What You'll Learn

The final differentiator. You'll instrument your system with Langfuse and LangSmith, write evals with pytest, set up alerting, and ship your capstone project — a complete production AI system that demonstrates real engineering judgment.

Session Schedule

DayTimeFocus
Saturday8:00 - 11:00 PM WATObservability & Evaluation
Sunday8:00 - 11:00 PM WATCapstone Review & Presentations

Pre-Requisites

  • ALL weeks completed
  • CORTEX project at M2+
  • Deployed API from Week 07

Topics Covered

Langfuse & LangSmith Tracing

Trace setup, span tracking, cost attribution, latency analysis. See exactly what your agents are doing and how much it costs.

Langfuse LangSmith Tracing

LLM Evaluation Frameworks

LLM-as-judge, pairwise comparison, rubric-based scoring. Measure quality systematically instead of vibes-based testing.

LLM Judge Pairwise Rubric

pytest for AI Systems

Deterministic tests, golden dataset tests, flaky test handling, fixtures. Build a test suite that catches regressions before users do.

pytest Golden Dataset Fixtures

Cost Monitoring & Optimization

Token tracking, cost per query, budget alerts, model routing for cost. Keep your AI system profitable, not just functional.

Cost Tracking Budget Model Routing

Capstone Project Review

Architecture review, code review, demo preparation, portfolio packaging. Polish your capstone into something you're proud to show employers.

Architecture Review Demo Portfolio

Weekly Build: Full Production System

Ship your capstone: a complete production AI system with observability, evaluation dashboard, and portfolio-ready documentation.

Architecture

CAPSTONE SYSTEM
    |
    ├── Agent Layer (LangGraph supervisor + specialists)
    ├── RAG Layer (7-layer pipeline + hybrid search)
    ├── Memory Layer (Redis + PostgreSQL)
    ├── API Layer (FastAPI + auth + rate limiting)
    ├── Async Layer (Celery + webhooks)
    |
    v
OBSERVABILITY
    ├── Langfuse: trace every query
    ├── Cost Dashboard: $/query tracking
    ├── Eval Suite: P@5, MRR, LLM judge
    └── Alerting: latency + error rate
    |
    v
PORTFOLIO
    ├── README with architecture diagram
    ├── Benchmark numbers (P@5 ≥ 0.70)
    ├── Docker: one-command deploy
    └── 3-min video demo (optional)

Key Files

FilePurpose
observability/langfuse_setup.pyLangfuse trace configuration
observability/cost_dashboard.pyCost tracking dashboard
tests/evaluation/test_llm_judge.pyLLM-as-judge evaluation tests
tests/evaluation/golden_dataset.jsonGolden dataset for regression testing
README.mdPortfolio writeup

Resources

Required Reading

  • Langfuse Documentation — Tracing & Evaluation
  • LangSmith Documentation — Testing & Monitoring
  • Hamel Husain — "Your AI Product Needs Evals"

Code Repository

Clone the bootcamp repo and switch to the week-08 branch:

git clone https://github.com/softbricks-academy/agentic-ai-bootcamp.git
cd agentic-ai-bootcamp
git checkout week-08

Session Recording

Recording will be available within 24 hours after the live session. Check the WhatsApp group for the link.

Homework

Final submissions — due by end of bootcamp.

  1. Complete capstone and push final code — all layers working, tests passing, deployed to cloud
  2. Record 3-minute demo video — use Loom to walk through your system architecture and live demo
  3. Write README with architecture diagram — include benchmark numbers (P@5, MRR, latency)
  4. Submit capstone for review — share repo link and deployed URL in the WhatsApp group