CORTEX

Production Enterprise Intelligence Platform

"The intelligent nervous system of your enterprise. Knows your documents. Remembers your employees. Connects to your tools. Never fails silently."

W4
End of Week 4
Deliverable
2-3
Weeks to Build
W1-W4
Skills Applied
Portfolio
Centrepiece
Overview
Portfolio Value
Architecture
Skills Map
Components
5 Tools
Folder Structure
M1
M2
M3
Metrics
Rubric
Get Started

The Problem — It's Real, It's Expensive

Enterprise employees waste 2-4 hours per week searching for information that already exists somewhere in the company. HR policies, IT runbooks, compliance docs, pricing guides — scattered across Confluence, shared drives, email threads, Notion.

Junior staff don't know who to ask. Managers make decisions on stale data. Support teams re-answer the same questions 40 times a day. And when an AI chatbot is finally bought? It's a black box with no memory, no access control, and no way to know if it's actually working.

Real cost estimate: 500-person company x 2.5 hrs/week wasted x $80/hr avg salary = $5.2M/year in lost productivity.

What Cortex Solves

Cortex is a production-ready multi-agent AI platform that acts as an organisation's intelligent assistant. Employees ask questions in natural language. Cortex routes them to the right specialised agent, retrieves from the company knowledge base, remembers the user across sessions, takes actions through tools, and never returns an empty error.

It is the system you would actually build and deploy at a company — not a demo, not a notebook, not a PoC. A real production system with metrics, tests, memory, access control, and reliability engineering.

Three User Tiers — One System

Standard Employee

HR policies, IT help, onboarding docs, PTO queries, benefit information. Public + internal access tier.

Manager

Team data, budget queries, performance review docs, headcount planning. Internal + confidential access tier.

Executive

Strategic documents, financial reports, M&A briefings, board materials. Full access — all tiers.

Why This Problem? Why This Project?

Every company needs this. Every hiring manager knows this problem. When you say "I built an enterprise AI assistant with production RAG, multi-agent routing, persistent memory, and 18 automated tests" — they immediately understand the complexity and the value. This isn't a contrived exercise. This is the project that gets you the interview.

Why This is Portfolio-Worthy

Most AI portfolio projects are single-file notebooks that call an API. Cortex is an architecture. Here's exactly how it differentiates you.

Typical Portfolio Project

  • Jupyter notebook
  • One LLM call
  • No error handling
  • No tests
  • No memory
  • No metrics
  • "It worked on my machine"
  • Forgotten in 3 weeks

Cortex

  • Multi-agent LangGraph system
  • 7-layer RAG pipeline
  • Rate limiting + 3-tier fallbacks
  • 18+ automated tests
  • Redis + PostgreSQL memory
  • P@5 / MRR / NDCG metrics
  • Docker + one-command deploy
  • README with real benchmark numbers

The Interview Narrative

"Tell me about a project you're proud of."

"I built Cortex — a production-grade enterprise intelligence platform. It's a multi-agent system built on LangGraph: a supervisor agent routes queries to three specialised agents — a knowledge agent backed by a 7-layer RAG pipeline with PGVector and hybrid search, a research agent with web tool integrations, and an action agent for tickets and reports.

The reliability stack includes rate limiting with exponential backoff, 3-tier fallback chains, and per-query cost tracking with budget enforcement. The memory layer uses Redis for session context and PostgreSQL for long-term entity storage — so the system remembers what a user told it last week. I shipped an evaluation dashboard that prints Precision@5 and MRR on every test run. The test suite has 18 automated tests. It runs on Docker with one command."

That answer gets you to the technical round. Every time.

What Hiring Managers See

SignalWhat It Proves
7-layer RAG pipelineYou understand production retrieval, not just "call OpenAI and pass the docs"
Multi-agent LangGraphYou understand orchestration and state machines, not just single-agent demos
Rate limiting + fallbacksYou've thought about failure modes — this separates seniors from juniors
Retrieval metrics (P@5, MRR)You measure quality, not just "it looks right"
18 automated testsYou write tests — most AI engineers don't
Redis + PostgreSQL memoryYou understand persistence patterns, not just in-memory state
Docker + .env.exampleYou understand deployment and security basics
Versioned prompts (YAML)You treat prompts as code — a rare and valued skill

The Differentiator

Every week of this bootcamp teaches a concept in isolation. Cortex is the proof that you can combine them into a coherent system. The integration — making the RAG pipeline feed the knowledge agent which feeds the supervisor which feeds the memory layer — is harder than any individual component. Hiring managers know this.

System Architecture

Cortex has four layers: the Orchestration layer (Week 1/2), the Intelligence layer (Week 3), the Tools layer (Week 4), and the Infrastructure layer (Weeks 2/3/4).

Full System — User Query to Response

User Query + user_id + access_tier (standard / manager / exec)
|
SUPERVISOR AGENT — LangGraph State Machine
Versioned prompts · Cost tracking · Graceful degradation · VIP routing (W1 + W2)
/  |  \
KNOWLEDGE AGENT
7-Layer RAG
PGVector + Hybrid Search
RBAC access control
(W3)
RESEARCH AGENT
Web Search
Rate limiting + Backoff
3-tier fallbacks
(W4)
ACTION AGENT
Ticket / Calendar
Report generation
Budget enforcement
(W4)
|
MEMORY LAYER
Redis: sliding window session (last 6 turns)
PostgreSQL: entity store (cross-session) (W3)
OBSERVABILITY
Structured logging · Cost per query
P@5/MRR/NDCG dashboard (W2+W3)

Data Flow — One Query

StepWhat HappensComponent
1User sends "What's our parental leave policy?" with user_id=emp_123Entry point
2Supervisor classifies intent -> "internal knowledge lookup"Supervisor LangGraph
3Supervisor checks access tier -> "standard employee"Supervisor + RBAC
4Routes to Knowledge AgentSupervisor conditional edge
5Query expanded: "parental leave" -> "maternity leave / paternity leave / family leave policy"Layer 5 (query understanding)
6Hybrid search: vector + BM25 + RRF -> top 5 docs retrievedLayer 7 (hybrid search)
7Access filter: confidential docs removed for standard tierLayer 6 (RBAC)
8Entity memory loaded: "Riya — HR enquiry history" (from PostgreSQL)Memory layer
9LLM generates grounded answer from retrieved docs + entity contextKnowledge Agent LLM
10Response logged with cost + P@5 score. Session saved to RedisObservability + Memory

Skills Mapping: Weeks 1-4 -> Cortex

Every major concept from every week of Phase 1 is represented in the system. This is what makes Cortex a genuine integration project, not a standalone exercise.

WeekConcept TaughtWhere It Lives in Cortex
W1LangGraph state machineSupervisor agent — the routing graph with conditional edges
W1@tool decoratorAll 5 tools use @tool with proper schemas
W1Conditional routing (VIP/Standard)3-tier access routing: standard / manager / exec
W1AgentExecutor -> LangGraph upgradeFull LangGraph graph replaces AgentExecutor
W2Prompt versioning (YAML)prompts/supervisor/v1.0.0.yaml — every agent prompt versioned
W2Structured logging + cost trackingobservability/logger.py — logs every query with token cost
W2Supervisor vs peer-to-peer multi-agentSupervisor -> Knowledge/Research/Action pattern
W2Graceful degradationIf Knowledge Agent fails -> Research Agent fallback
W2Prompt injection defenseInput sanitisation before passing to supervisor
W37-Layer Enterprise RAG (all 7)Full pipeline in rag/ folder — L1 through L7
W3PGVector + hybrid search (RRF)Layer 4 storage + Layer 7 BM25+vector+RRF
W3Redis session memorySliding window (last 6 turns) per user session
W3PostgreSQL entity storeLong-term facts that survive server restarts
W3Retrieval metrics (P@5, MRR, NDCG)Evaluation dashboard prints on every test run
W3RBAC access controlLayer 6 — filters docs by user tier
W4Tool schema designAll 5 tools — proper descriptions, types, error feedback
W4Rate limiting + exponential backoffreliability/rate_limiter.py — wraps all external tool calls
W43-tier fallback chainsreliability/fallback.py — primary -> backup -> default response
W4Tool cost tracking + budgetsreliability/cost_tracker.py — per-query budget enforcement
W4Unit + integration teststests/unit/ and tests/integration/ — 18+ tests
W4LLM judge evaluationtests/evaluation/test_llm_judge.py

The Integration Is the Hard Part

Any individual component above is a 2-hour exercise from one session. Making them work together — the LangGraph supervisor calling the knowledge agent which runs the RAG pipeline which checks RBAC which loads entity context from PostgreSQL and session history from Redis — that is system design. That is what this project proves you can do.

Core Components

Supervisor Agent

LangGraph state machine. Receives every query. Classifies intent. Checks user tier. Routes to one of three agents. Handles graceful degradation if sub-agent fails. Versioned prompts.

W1 W2

Knowledge Agent

Specialist for internal documents. Runs the full 7-layer RAG pipeline. Query understanding -> hybrid search -> RBAC filter -> grounded LLM answer. The core intelligence layer.

W3 (all 7 layers)

Research Agent

Handles queries requiring external information. Web search with DuckDuckGo/SerpAPI. Rate limiting. 3-tier fallback. Returns synthesised external research.

W4

Action Agent

Ticket creation, calendar queries, report generation. Validated tool schemas. Budget-aware. Every action logged. Error feedback to LLM when tools fail.

W4

Memory Layer

Redis: sliding window of last 6 exchanges (session). PostgreSQL: entity store for facts that survive restarts (name, order IDs, preferences). Two-tier architecture.

W3

Observability

Structured logging per query: user_id, agent_used, tokens, cost, latency. Retrieval dashboard: P@5, MRR, NDCG printed on test run. Cost budget enforcement.

W2 W3

The 5 Tools

Each tool demonstrates a different Week 4 production pattern. Together they form the "tool suite" — analogous to the Week 4 research assistant assignment, but purpose-built for Cortex.

ToolWhat It DoesW4 PatternFailure Mode
knowledge_base_searchSearches internal PGVector RAG. Returns top-K docs with relevance scores.Proper tool schema + error feedback to LLMReturns "no relevant docs found" — LLM-recoverable message
web_searchDuckDuckGo (free) with SerpAPI fallback. Returns top 5 results with snippets.Rate limiting + exponential backoffPrimary fails -> SerpAPI fallback -> cached last result
create_support_ticketCreates a Jira-like ticket. Returns ticket ID and estimated response time.Input validation + 3-tier fallbackAPI down -> in-memory queue -> returns queue ID
get_team_calendarReturns availability for a team or person (mock data with realistic schedule).Auth pattern + graceful failureReturns "calendar unavailable, try again in 5 minutes"
generate_reportFormats collected information into a structured report with sections.Token budget enforcementIf budget exceeded -> returns summary only, not full report

Why These 5 Specifically?

  • knowledge_base_search — connects Week 4 tool design directly to the Week 3 RAG pipeline. The integration between layers is what makes this Cortex, not just a research assistant.
  • web_search — the most failure-prone real-world tool. Rate limits, costs, flaky APIs. Perfect vehicle for teaching rate limiting and fallbacks.
  • create_support_ticket — represents write operations. Different failure contract than read operations. Tests idempotency thinking.
  • get_team_calendar — authentication and token-based auth pattern. Shows secrets management in action.
  • generate_report — output tool. Tests budget enforcement — some queries should produce full reports, others a summary if tokens are expensive.

Folder Structure

Clean, modular, production-aligned. Every folder has a single responsibility. The structure itself communicates that you understand software architecture.

cortex/
├── README.md                    # Portfolio write-up with metrics + architecture diagram
├── docker-compose.yml           # PGVector + Redis — starts with one command
├── requirements.txt
├── .env.example                 # Never commit real keys
├── main.py                      # Entry point: start Cortex
│
├── agents/
│   ├── supervisor.py            # LangGraph state machine (W1 + W2)
│   ├── knowledge_agent.py       # RAG specialist — calls rag/ pipeline (W3)
│   ├── research_agent.py        # Web research + rate limiting (W4)
│   └── action_agent.py          # Tools: ticket, calendar, report (W4)
│
├── rag/
│   ├── pipeline.py              # Orchestrates all 7 layers
│   ├── ingestion/               # L1: PDF/docx/txt processing
│   │   └── document_loader.py
│   ├── chunking.py              # L2: semantic chunking
│   ├── embeddings.py            # L3: OpenAI / HuggingFace
│   ├── vector_store.py          # L4: PGVector CRUD
│   ├── query_understanding.py   # L5: reformulation + expansion + intent
│   ├── access_control.py        # L6: RBAC tier filter
│   ├── hybrid_search.py         # L7: BM25 + vector + RRF
│   └── evaluation.py            # P@5 / MRR / NDCG dashboard
│
├── memory/
│   ├── session_memory.py        # Redis sliding window (W3)
│   └── entity_store.py          # PostgreSQL long-term entities (W3)
│
├── tools/
│   ├── web_search.py            # DuckDuckGo + SerpAPI fallback (W4)
│   ├── ticketing.py             # Ticket creation with validation (W4)
│   ├── calendar.py              # Calendar with auth pattern (W4)
│   └── report_generator.py      # Budget-enforced report tool (W4)
│
├── reliability/
│   ├── rate_limiter.py          # Token bucket + exponential backoff (W4)
│   ├── fallback.py              # 3-tier fallback chains (W4)
│   └── cost_tracker.py          # Per-query budget enforcement (W4)
│
├── prompts/
│   ├── supervisor/
│   │   ├── v1.0.0.yaml          # Version 1 (W2)
│   │   └── v1.1.0.yaml          # A/B test variant
│   └── agents/
│       ├── knowledge_agent/v1.0.0.yaml
│       └── research_agent/v1.0.0.yaml
│
├── observability/
│   ├── logger.py                # Structured logging: user, agent, cost (W2)
│   └── metrics.py               # Aggregated cost + quality dashboard
│
├── data/
│   └── sample_knowledge_base/   # 10 sample docs: HR, IT, policy, finance
│
└── tests/
    ├── unit/
    │   ├── test_tools.py        # Tool schema + error handling (W4)
    │   ├── test_rate_limiter.py
    │   ├── test_fallback.py
    │   ├── test_cost_tracker.py
    │   └── test_rag_pipeline.py
    ├── integration/
    │   ├── test_supervisor_routing.py
    │   ├── test_knowledge_agent.py
    │   └── test_full_pipeline.py
    └── evaluation/
        └── test_llm_judge.py    # LLM-as-judge (W4)

Milestone 1 — Core System

Target: End of Week 4 | "It works."

The foundational plumbing. Supervisor routes. Knowledge Agent retrieves. You can have a multi-turn conversation that returns grounded answers from your document set. Memory holds context for the session. Logging tracks costs.

M1 Checklist

  • Docker Compose running: PGVector + Redis both healthy
  • Sample knowledge base loaded: 10 documents ingested into PGVector
  • Supervisor agent implemented: LangGraph graph with 3 nodes (supervisor, knowledge, fallback)
  • Knowledge Agent working: runs full 7-layer RAG pipeline end-to-end
  • Basic memory: Redis session sliding window stores last 6 messages
  • 2 tools functional: knowledge_base_search + web_search
  • Structured logging: every query logs user_id, agent, tokens, cost
  • Can answer: "What's the PTO policy?" (internal) and "What is RAG?" (web)
  • 3 unit tests passing for core components

Common Mistakes to Avoid

  • Don't skip the Docker setup — PGVector must be running or RAG doesn't work. Run docker-compose up -d as your first step.
  • Don't make the supervisor too complex — start with a simple 3-state graph (route to knowledge, route to research, or say "I don't know"). You can add more states in M2.
  • Don't try to do M2 in M1 — P@5 measurement, all 5 tools, and PostgreSQL entity memory are M2. Get the core working first.

Milestone 2 — Full Production Stack

Target: 1 Week After M1 | "It doesn't break."

The full reliability and memory stack. All 5 tools. Rate limiting wrapping every external call. 3-tier fallbacks. PostgreSQL entity memory persisting across sessions. Versioned prompts. Budget enforcement. The system survives realistic failure scenarios.

M2 Checklist

  • All 5 tools implemented with proper @tool schemas
  • Rate limiter wrapping web_search and external API calls
  • 3-tier fallback chain: primary -> backup -> graceful default
  • PostgreSQL entity store: entities extracted and persisted across sessions
  • Versioned prompts: supervisor and all agent prompts in YAML files
  • Token budget enforcement: per-query limit, stops before going over
  • Research Agent and Action Agent fully functional
  • Graceful degradation: if Knowledge Agent fails, system doesn't crash
  • 10+ unit tests passing (tools, rate limiter, fallback, cost tracker)
  • Server restart test: session restored from Redis, entities from PostgreSQL

M2 Definition of Done Test

Run this scenario to verify M2 is complete:

  1. Start a conversation as "standard employee Riya, order ORD-789"
  2. Kill the process (Ctrl+C)
  3. Restart the process with the same session_id
  4. Ask "What was my order number?" — it should answer ORD-789
  5. Search for something external with web_search — should not crash even if the primary API is rate-limited
  6. Ask for a report that would exceed your token budget — should return summary instead

Milestone 3 — Portfolio Ready

Target: Before Week 5 | "I can defend every decision."

The evaluation layer, the full test suite, and the portfolio packaging. P@5 and MRR numbers on a test run. LLM judge evaluation scoring response quality. A clean README with real benchmark numbers. An architecture diagram. The project is ready to show to a hiring manager.

M3 Checklist

  • Retrieval evaluation: 20-query golden dataset with labelled relevant docs
  • P@5 >= 0.70 achieved (if below 0.70 — fix chunking or query expansion first)
  • MRR >= 0.65 achieved
  • Evaluation dashboard prints on pytest tests/evaluation/ run
  • LLM judge test: 5 queries evaluated for faithfulness and relevance
  • Full test suite: 18+ tests, all passing
  • Architecture diagram in README (draw.io or mermaid)
  • README sections: Problem, Solution, Architecture, How to Run, Benchmark Numbers
  • Docker: docker-compose up && python main.py starts the full system
  • .env.example with all required variables documented
  • Video demo (optional stretch): 3-minute Loom showing a full multi-turn conversation

What Makes M3 Different

M1 proves it works. M2 proves it's reliable. M3 proves you can measure it and communicate it. A hiring manager won't run your code. They'll read your README and look at your numbers. M3 is where you package the work into something that communicates its value without a demo.

Success Metrics

These are the numbers your README should contain and your test suite should produce automatically.

Retrieval Quality

>= 0.70
Precision@5
Minimum pass gate
>= 0.65
MRR
First relevant result ranking
>= 0.70
NDCG@5
Ranking quality
0.84
Stretch P@5
(7-layer with hybrid)

Reliability Targets

>= 95%
Query success rate
(no unhandled exceptions)
<= $0.05
Max cost per query
(budget enforcement)
18+
Automated tests
all passing
3-tier
Fallback depth
for all external calls

Memory Correctness

Test ScenarioExpected Result
Server restart mid-sessionEntity memory restored from PostgreSQL
Turn 8: "What was my order number?" (told in turn 1)Correct entity retrieved
New session, same userLong-term entities loaded from PostgreSQL
Standard employee queries financial reportAccess denied — RBAC filters doc

Grading Rubric

Maximum: 100 points. Portfolio-worthy threshold: 75 points. Distinction: 90 points.

CategoryMaxExcellent (Full)Good (70%)Acceptable (50%)
Architecture (W1/W2)15LangGraph supervisor working, all 3 agents routing correctly, conditional edges for all 3 user tiersSupervisor routes to 2 agents, basic conditional routingSimple if/else routing without LangGraph
RAG Pipeline (W3)20All 7 layers, hybrid search, P@5 >= 0.70 demonstrated5+ layers, basic vector search, P@5 >= 0.553 layers (chunk, embed, retrieve), no metrics
Memory System (W3)15Redis + PostgreSQL both working, server restart test passes, entity extraction demonstratedRedis session working, no PostgreSQLIn-memory only — buffer memory
Tools & Reliability (W4)20All 5 tools, rate limiting on external calls, 3-tier fallbacks, budget enforcement all working3+ tools, rate limiting working, no fallbacks2 tools, no reliability patterns
Test Suite (W4)1518+ tests passing, unit + integration + LLM judge evaluation10+ tests, unit only5+ tests, mostly smoke tests
Observability (W2/W3)5Cost per query logged, P@5/MRR printed on eval runCost logged, no retrieval metricsBasic print logging only
Portfolio Packaging10Docker works, README has metrics, architecture diagram, clear run instructionsREADME exists, no metrics, no diagramCode only, no documentation

Bonus Points (up to +10)

  • +3 — Prompt A/B testing framework working (two prompt variants, comparison output)
  • +3 — Video demo (3-minute Loom showing multi-turn conversation end-to-end)
  • +4 — Streamlit or FastAPI front-end for Cortex (even simple chat interface)

Getting Started — Your First 2 Hours

Follow this sequence exactly. Don't skip ahead to the interesting parts before the plumbing is working.

Hour 1 — Infrastructure & Skeleton

# Step 1: Create the project folder and git init
mkdir cortex && cd cortex && git init

# Step 2: docker-compose.yml
docker-compose up -d
docker ps   # Should show pgvector and redis both running

# Step 3: Install dependencies
pip install langchain langgraph langchain-openai langchain-community
pip install pgvector psycopg2-binary redis python-dotenv
pip install rank-bm25 pytest pytest-mock pyyaml

# Step 4: Create .env from example
cp .env.example .env
# Fill in OPENAI_API_KEY — everything else has defaults

# Step 5: Create folder skeleton
mkdir -p agents rag/ingestion memory tools reliability \
  prompts/supervisor observability tests/unit \
  tests/integration data/sample_knowledge_base

Hour 2 — First Working Query

# Step 6: Copy the sample documents (from Solution_Code_Snippets/data/)
# 10 documents: hr_policy.txt, it_runbook.txt, parental_leave.txt, etc.

# Step 7: Ingest documents into PGVector
python -c "from rag.pipeline import ingest_documents; \
  ingest_documents('data/sample_knowledge_base')"

# Step 8: Test a single RAG query (before supervisor)
python -c "
from rag.pipeline import query
result = query('What is the PTO policy?', user_tier='standard')
print(result)
"

# Step 9: Start the supervisor (basic version)
python main.py
# Should respond to: 'What is our parental leave policy?'
# Should respond to: 'What is machine learning?' (routes to web search)

First Milestone Check

You should be able to run two queries:

  1. "What is the PTO policy?" -> routed to Knowledge Agent -> retrieves from PGVector -> grounded answer
  2. "What is a large language model?" -> routed to Research Agent -> returns web search result

If both of these work — your foundation is solid. Proceed to M2 components.

Implementation Order Matters

Build in this order: Infrastructure (Docker) -> RAG pipeline -> Knowledge Agent -> Supervisor (basic) -> Memory -> Remaining tools -> Reliability stack -> Tests -> Evaluation. Resist the urge to build the supervisor first. The supervisor is only as good as the agents it coordinates.