CORTEX

The Problem — It's Real, It's Expensive

Enterprise employees waste 2-4 hours per week searching for information that already exists somewhere in the company. HR policies, IT runbooks, compliance docs, pricing guides — scattered across Confluence, shared drives, email threads, Notion.

Junior staff don't know who to ask. Managers make decisions on stale data. Support teams re-answer the same questions 40 times a day. And when an AI chatbot is finally bought? It's a black box with no memory, no access control, and no way to know if it's actually working.

Real cost estimate: 500-person company x 2.5 hrs/week wasted x $80/hr avg salary = $5.2M/year in lost productivity.

What Cortex Solves

Cortex is a production-ready multi-agent AI platform that acts as an organisation's intelligent assistant. Employees ask questions in natural language. Cortex routes them to the right specialised agent, retrieves from the company knowledge base, remembers the user across sessions, takes actions through tools, and never returns an empty error.

It is the system you would actually build and deploy at a company — not a demo, not a notebook, not a PoC. A real production system with metrics, tests, memory, access control, and reliability engineering.

Three User Tiers — One System

Standard Employee

HR policies, IT help, onboarding docs, PTO queries, benefit information. Public + internal access tier.

Manager

Team data, budget queries, performance review docs, headcount planning. Internal + confidential access tier.

Executive

Strategic documents, financial reports, M&A briefings, board materials. Full access — all tiers.

Why This Problem? Why This Project?

Every company needs this. Every hiring manager knows this problem. When you say "I built an enterprise AI assistant with production RAG, multi-agent routing, persistent memory, and 18 automated tests" — they immediately understand the complexity and the value. This isn't a contrived exercise. This is the project that gets you the interview.

Why This is Portfolio-Worthy

Most AI portfolio projects are single-file notebooks that call an API. Cortex is an architecture. Here's exactly how it differentiates you.

Typical Portfolio Project

Jupyter notebook
One LLM call
No error handling
No tests
No memory
No metrics
"It worked on my machine"
Forgotten in 3 weeks

Multi-agent LangGraph system
7-layer RAG pipeline
Rate limiting + 3-tier fallbacks
18+ automated tests
Redis + PostgreSQL memory
P@5 / MRR / NDCG metrics
Docker + one-command deploy
README with real benchmark numbers

The Interview Narrative

"Tell me about a project you're proud of."

"I built Cortex — a production-grade enterprise intelligence platform. It's a multi-agent system built on LangGraph: a supervisor agent routes queries to three specialised agents — a knowledge agent backed by a 7-layer RAG pipeline with PGVector and hybrid search, a research agent with web tool integrations, and an action agent for tickets and reports.

The reliability stack includes rate limiting with exponential backoff, 3-tier fallback chains, and per-query cost tracking with budget enforcement. The memory layer uses Redis for session context and PostgreSQL for long-term entity storage — so the system remembers what a user told it last week. I shipped an evaluation dashboard that prints Precision@5 and MRR on every test run. The test suite has 18 automated tests. It runs on Docker with one command."

That answer gets you to the technical round. Every time.

What Hiring Managers See

Signal	What It Proves
7-layer RAG pipeline	You understand production retrieval, not just "call OpenAI and pass the docs"
Multi-agent LangGraph	You understand orchestration and state machines, not just single-agent demos
Rate limiting + fallbacks	You've thought about failure modes — this separates seniors from juniors
Retrieval metrics (P@5, MRR)	You measure quality, not just "it looks right"
18 automated tests	You write tests — most AI engineers don't
Redis + PostgreSQL memory	You understand persistence patterns, not just in-memory state
Docker + .env.example	You understand deployment and security basics
Versioned prompts (YAML)	You treat prompts as code — a rare and valued skill

The Differentiator

Every week of this bootcamp teaches a concept in isolation. Cortex is the proof that you can combine them into a coherent system. The integration — making the RAG pipeline feed the knowledge agent which feeds the supervisor which feeds the memory layer — is harder than any individual component. Hiring managers know this.

System Architecture

Cortex has four layers: the Orchestration layer (Week 1/2), the Intelligence layer (Week 3), the Tools layer (Week 4), and the Infrastructure layer (Weeks 2/3/4).

Full System — User Query to Response

User Query + user_id + access_tier (standard / manager / exec)

|

SUPERVISOR AGENT — LangGraph State Machine

Versioned prompts · Cost tracking · Graceful degradation · VIP routing (W1 + W2)

/ | \

KNOWLEDGE AGENT

7-Layer RAG
PGVector + Hybrid Search
RBAC access control
(W3)

RESEARCH AGENT

Web Search
Rate limiting + Backoff
3-tier fallbacks
(W4)

ACTION AGENT

Ticket / Calendar
Report generation
Budget enforcement
(W4)

|

MEMORY LAYER

Redis: sliding window session (last 6 turns)
PostgreSQL: entity store (cross-session) (W3)

OBSERVABILITY

Structured logging · Cost per query
P@5/MRR/NDCG dashboard (W2+W3)

Data Flow — One Query

Step	What Happens	Component
1	User sends "What's our parental leave policy?" with user_id=emp_123	Entry point
2	Supervisor classifies intent -> "internal knowledge lookup"	Supervisor LangGraph
3	Supervisor checks access tier -> "standard employee"	Supervisor + RBAC
4	Routes to Knowledge Agent	Supervisor conditional edge
5	Query expanded: "parental leave" -> "maternity leave / paternity leave / family leave policy"	Layer 5 (query understanding)
6	Hybrid search: vector + BM25 + RRF -> top 5 docs retrieved	Layer 7 (hybrid search)
7	Access filter: confidential docs removed for standard tier	Layer 6 (RBAC)
8	Entity memory loaded: "Riya — HR enquiry history" (from PostgreSQL)	Memory layer
9	LLM generates grounded answer from retrieved docs + entity context	Knowledge Agent LLM
10	Response logged with cost + P@5 score. Session saved to Redis	Observability + Memory

Skills Mapping: Weeks 1-4 -> Cortex

Every major concept from every week of Phase 1 is represented in the system. This is what makes Cortex a genuine integration project, not a standalone exercise.

Week	Concept Taught	Where It Lives in Cortex
W1	LangGraph state machine	Supervisor agent — the routing graph with conditional edges
W1	@tool decorator	All 5 tools use @tool with proper schemas
W1	Conditional routing (VIP/Standard)	3-tier access routing: standard / manager / exec
W1	AgentExecutor -> LangGraph upgrade	Full LangGraph graph replaces AgentExecutor
W2	Prompt versioning (YAML)	prompts/supervisor/v1.0.0.yaml — every agent prompt versioned
W2	Structured logging + cost tracking	observability/logger.py — logs every query with token cost
W2	Supervisor vs peer-to-peer multi-agent	Supervisor -> Knowledge/Research/Action pattern
W2	Graceful degradation	If Knowledge Agent fails -> Research Agent fallback
W2	Prompt injection defense	Input sanitisation before passing to supervisor
W3	7-Layer Enterprise RAG (all 7)	Full pipeline in rag/ folder — L1 through L7
W3	PGVector + hybrid search (RRF)	Layer 4 storage + Layer 7 BM25+vector+RRF
W3	Redis session memory	Sliding window (last 6 turns) per user session
W3	PostgreSQL entity store	Long-term facts that survive server restarts
W3	Retrieval metrics (P@5, MRR, NDCG)	Evaluation dashboard prints on every test run
W3	RBAC access control	Layer 6 — filters docs by user tier
W4	Tool schema design	All 5 tools — proper descriptions, types, error feedback
W4	Rate limiting + exponential backoff	reliability/rate_limiter.py — wraps all external tool calls
W4	3-tier fallback chains	reliability/fallback.py — primary -> backup -> default response
W4	Tool cost tracking + budgets	reliability/cost_tracker.py — per-query budget enforcement
W4	Unit + integration tests	tests/unit/ and tests/integration/ — 18+ tests
W4	LLM judge evaluation	tests/evaluation/test_llm_judge.py

The Integration Is the Hard Part

Any individual component above is a 2-hour exercise from one session. Making them work together — the LangGraph supervisor calling the knowledge agent which runs the RAG pipeline which checks RBAC which loads entity context from PostgreSQL and session history from Redis — that is system design. That is what this project proves you can do.

Core Components

Supervisor Agent

LangGraph state machine. Receives every query. Classifies intent. Checks user tier. Routes to one of three agents. Handles graceful degradation if sub-agent fails. Versioned prompts.

W1 W2

Knowledge Agent

Specialist for internal documents. Runs the full 7-layer RAG pipeline. Query understanding -> hybrid search -> RBAC filter -> grounded LLM answer. The core intelligence layer.

W3 (all 7 layers)

Research Agent

Handles queries requiring external information. Web search with DuckDuckGo/SerpAPI. Rate limiting. 3-tier fallback. Returns synthesised external research.

W4

Action Agent

Ticket creation, calendar queries, report generation. Validated tool schemas. Budget-aware. Every action logged. Error feedback to LLM when tools fail.

W4

Memory Layer

Redis: sliding window of last 6 exchanges (session). PostgreSQL: entity store for facts that survive restarts (name, order IDs, preferences). Two-tier architecture.

W3

Observability

Structured logging per query: user_id, agent_used, tokens, cost, latency. Retrieval dashboard: P@5, MRR, NDCG printed on test run. Cost budget enforcement.

W2 W3

The 5 Tools

Each tool demonstrates a different Week 4 production pattern. Together they form the "tool suite" — analogous to the Week 4 research assistant assignment, but purpose-built for Cortex.

Tool	What It Does	W4 Pattern	Failure Mode
`knowledge_base_search`	Searches internal PGVector RAG. Returns top-K docs with relevance scores.	Proper tool schema + error feedback to LLM	Returns "no relevant docs found" — LLM-recoverable message
`web_search`	DuckDuckGo (free) with SerpAPI fallback. Returns top 5 results with snippets.	Rate limiting + exponential backoff	Primary fails -> SerpAPI fallback -> cached last result
`create_support_ticket`	Creates a Jira-like ticket. Returns ticket ID and estimated response time.	Input validation + 3-tier fallback	API down -> in-memory queue -> returns queue ID
`get_team_calendar`	Returns availability for a team or person (mock data with realistic schedule).	Auth pattern + graceful failure	Returns "calendar unavailable, try again in 5 minutes"
`generate_report`	Formats collected information into a structured report with sections.	Token budget enforcement	If budget exceeded -> returns summary only, not full report

Why These 5 Specifically?

knowledge_base_search — connects Week 4 tool design directly to the Week 3 RAG pipeline. The integration between layers is what makes this Cortex, not just a research assistant.
web_search — the most failure-prone real-world tool. Rate limits, costs, flaky APIs. Perfect vehicle for teaching rate limiting and fallbacks.
create_support_ticket — represents write operations. Different failure contract than read operations. Tests idempotency thinking.
get_team_calendar — authentication and token-based auth pattern. Shows secrets management in action.
generate_report — output tool. Tests budget enforcement — some queries should produce full reports, others a summary if tokens are expensive.

Folder Structure

Clean, modular, production-aligned. Every folder has a single responsibility. The structure itself communicates that you understand software architecture.

cortex/
├── README.md                    # Portfolio write-up with metrics + architecture diagram
├── docker-compose.yml           # PGVector + Redis — starts with one command
├── requirements.txt
├── .env.example                 # Never commit real keys
├── main.py                      # Entry point: start Cortex
│
├── agents/
│   ├── supervisor.py            # LangGraph state machine (W1 + W2)
│   ├── knowledge_agent.py       # RAG specialist — calls rag/ pipeline (W3)
│   ├── research_agent.py        # Web research + rate limiting (W4)
│   └── action_agent.py          # Tools: ticket, calendar, report (W4)
│
├── rag/
│   ├── pipeline.py              # Orchestrates all 7 layers
│   ├── ingestion/               # L1: PDF/docx/txt processing
│   │   └── document_loader.py
│   ├── chunking.py              # L2: semantic chunking
│   ├── embeddings.py            # L3: OpenAI / HuggingFace
│   ├── vector_store.py          # L4: PGVector CRUD
│   ├── query_understanding.py   # L5: reformulation + expansion + intent
│   ├── access_control.py        # L6: RBAC tier filter
│   ├── hybrid_search.py         # L7: BM25 + vector + RRF
│   └── evaluation.py            # P@5 / MRR / NDCG dashboard
│
├── memory/
│   ├── session_memory.py        # Redis sliding window (W3)
│   └── entity_store.py          # PostgreSQL long-term entities (W3)
│
├── tools/
│   ├── web_search.py            # DuckDuckGo + SerpAPI fallback (W4)
│   ├── ticketing.py             # Ticket creation with validation (W4)
│   ├── calendar.py              # Calendar with auth pattern (W4)
│   └── report_generator.py      # Budget-enforced report tool (W4)
│
├── reliability/
│   ├── rate_limiter.py          # Token bucket + exponential backoff (W4)
│   ├── fallback.py              # 3-tier fallback chains (W4)
│   └── cost_tracker.py          # Per-query budget enforcement (W4)
│
├── prompts/
│   ├── supervisor/
│   │   ├── v1.0.0.yaml          # Version 1 (W2)
│   │   └── v1.1.0.yaml          # A/B test variant
│   └── agents/
│       ├── knowledge_agent/v1.0.0.yaml
│       └── research_agent/v1.0.0.yaml
│
├── observability/
│   ├── logger.py                # Structured logging: user, agent, cost (W2)
│   └── metrics.py               # Aggregated cost + quality dashboard
│
├── data/
│   └── sample_knowledge_base/   # 10 sample docs: HR, IT, policy, finance
│
└── tests/
    ├── unit/
    │   ├── test_tools.py        # Tool schema + error handling (W4)
    │   ├── test_rate_limiter.py
    │   ├── test_fallback.py
    │   ├── test_cost_tracker.py
    │   └── test_rag_pipeline.py
    ├── integration/
    │   ├── test_supervisor_routing.py
    │   ├── test_knowledge_agent.py
    │   └── test_full_pipeline.py
    └── evaluation/
        └── test_llm_judge.py    # LLM-as-judge (W4)

Milestone 1 — Core System

Target: End of Week 4 | "It works."

The foundational plumbing. Supervisor routes. Knowledge Agent retrieves. You can have a multi-turn conversation that returns grounded answers from your document set. Memory holds context for the session. Logging tracks costs.

M1 Checklist

Docker Compose running: PGVector + Redis both healthy
Sample knowledge base loaded: 10 documents ingested into PGVector
Supervisor agent implemented: LangGraph graph with 3 nodes (supervisor, knowledge, fallback)
Knowledge Agent working: runs full 7-layer RAG pipeline end-to-end
Basic memory: Redis session sliding window stores last 6 messages
2 tools functional: knowledge_base_search + web_search
Structured logging: every query logs user_id, agent, tokens, cost
Can answer: "What's the PTO policy?" (internal) and "What is RAG?" (web)
3 unit tests passing for core components

Common Mistakes to Avoid

Don't skip the Docker setup — PGVector must be running or RAG doesn't work. Run docker-compose up -d as your first step.
Don't make the supervisor too complex — start with a simple 3-state graph (route to knowledge, route to research, or say "I don't know"). You can add more states in M2.
Don't try to do M2 in M1 — P@5 measurement, all 5 tools, and PostgreSQL entity memory are M2. Get the core working first.

Milestone 2 — Full Production Stack

Target: 1 Week After M1 | "It doesn't break."

The full reliability and memory stack. All 5 tools. Rate limiting wrapping every external call. 3-tier fallbacks. PostgreSQL entity memory persisting across sessions. Versioned prompts. Budget enforcement. The system survives realistic failure scenarios.

M2 Checklist

All 5 tools implemented with proper @tool schemas
Rate limiter wrapping web_search and external API calls
3-tier fallback chain: primary -> backup -> graceful default
PostgreSQL entity store: entities extracted and persisted across sessions
Versioned prompts: supervisor and all agent prompts in YAML files
Token budget enforcement: per-query limit, stops before going over
Research Agent and Action Agent fully functional
Graceful degradation: if Knowledge Agent fails, system doesn't crash
10+ unit tests passing (tools, rate limiter, fallback, cost tracker)
Server restart test: session restored from Redis, entities from PostgreSQL

M2 Definition of Done Test

Run this scenario to verify M2 is complete:

Start a conversation as "standard employee Riya, order ORD-789"
Kill the process (Ctrl+C)
Restart the process with the same session_id
Ask "What was my order number?" — it should answer ORD-789
Search for something external with web_search — should not crash even if the primary API is rate-limited
Ask for a report that would exceed your token budget — should return summary instead

Milestone 3 — Portfolio Ready

Target: Before Week 5 | "I can defend every decision."

The evaluation layer, the full test suite, and the portfolio packaging. P@5 and MRR numbers on a test run. LLM judge evaluation scoring response quality. A clean README with real benchmark numbers. An architecture diagram. The project is ready to show to a hiring manager.

M3 Checklist

Retrieval evaluation: 20-query golden dataset with labelled relevant docs
P@5 >= 0.70 achieved (if below 0.70 — fix chunking or query expansion first)
MRR >= 0.65 achieved
Evaluation dashboard prints on pytest tests/evaluation/ run
LLM judge test: 5 queries evaluated for faithfulness and relevance
Full test suite: 18+ tests, all passing
Architecture diagram in README (draw.io or mermaid)
README sections: Problem, Solution, Architecture, How to Run, Benchmark Numbers
Docker: docker-compose up && python main.py starts the full system
.env.example with all required variables documented
Video demo (optional stretch): 3-minute Loom showing a full multi-turn conversation

What Makes M3 Different

M1 proves it works. M2 proves it's reliable. M3 proves you can measure it and communicate it. A hiring manager won't run your code. They'll read your README and look at your numbers. M3 is where you package the work into something that communicates its value without a demo.

Success Metrics

These are the numbers your README should contain and your test suite should produce automatically.

Retrieval Quality

>= 0.70

Precision@5

Minimum pass gate

>= 0.65

MRR

First relevant result ranking

>= 0.70

NDCG@5

Ranking quality

0.84

Stretch P@5

(7-layer with hybrid)

Reliability Targets

>= 95%

Query success rate

(no unhandled exceptions)

<= $0.05

Max cost per query

(budget enforcement)

18+

Automated tests

all passing

3-tier

Fallback depth

for all external calls

Memory Correctness

Test Scenario	Expected Result
Server restart mid-session	Entity memory restored from PostgreSQL
Turn 8: "What was my order number?" (told in turn 1)	Correct entity retrieved
New session, same user	Long-term entities loaded from PostgreSQL
Standard employee queries financial report	Access denied — RBAC filters doc

Grading Rubric

Maximum: 100 points. Portfolio-worthy threshold: 75 points. Distinction: 90 points.

Category	Max	Excellent (Full)	Good (70%)	Acceptable (50%)
Architecture (W1/W2)	15	LangGraph supervisor working, all 3 agents routing correctly, conditional edges for all 3 user tiers	Supervisor routes to 2 agents, basic conditional routing	Simple if/else routing without LangGraph
RAG Pipeline (W3)	20	All 7 layers, hybrid search, P@5 >= 0.70 demonstrated	5+ layers, basic vector search, P@5 >= 0.55	3 layers (chunk, embed, retrieve), no metrics
Memory System (W3)	15	Redis + PostgreSQL both working, server restart test passes, entity extraction demonstrated	Redis session working, no PostgreSQL	In-memory only — buffer memory
Tools & Reliability (W4)	20	All 5 tools, rate limiting on external calls, 3-tier fallbacks, budget enforcement all working	3+ tools, rate limiting working, no fallbacks	2 tools, no reliability patterns
Test Suite (W4)	15	18+ tests passing, unit + integration + LLM judge evaluation	10+ tests, unit only	5+ tests, mostly smoke tests
Observability (W2/W3)	5	Cost per query logged, P@5/MRR printed on eval run	Cost logged, no retrieval metrics	Basic print logging only
Portfolio Packaging	10	Docker works, README has metrics, architecture diagram, clear run instructions	README exists, no metrics, no diagram	Code only, no documentation

Bonus Points (up to +10)

+3 — Prompt A/B testing framework working (two prompt variants, comparison output)
+3 — Video demo (3-minute Loom showing multi-turn conversation end-to-end)
+4 — Streamlit or FastAPI front-end for Cortex (even simple chat interface)

Getting Started — Your First 2 Hours

Follow this sequence exactly. Don't skip ahead to the interesting parts before the plumbing is working.

Hour 1 — Infrastructure & Skeleton

# Step 1: Create the project folder and git init
mkdir cortex && cd cortex && git init

# Step 2: docker-compose.yml
docker-compose up -d
docker ps   # Should show pgvector and redis both running

# Step 3: Install dependencies
pip install langchain langgraph langchain-openai langchain-community
pip install pgvector psycopg2-binary redis python-dotenv
pip install rank-bm25 pytest pytest-mock pyyaml

# Step 4: Create .env from example
cp .env.example .env
# Fill in OPENAI_API_KEY — everything else has defaults

# Step 5: Create folder skeleton
mkdir -p agents rag/ingestion memory tools reliability \
  prompts/supervisor observability tests/unit \
  tests/integration data/sample_knowledge_base

Hour 2 — First Working Query

# Step 6: Copy the sample documents (from Solution_Code_Snippets/data/)
# 10 documents: hr_policy.txt, it_runbook.txt, parental_leave.txt, etc.

# Step 7: Ingest documents into PGVector
python -c "from rag.pipeline import ingest_documents; \
  ingest_documents('data/sample_knowledge_base')"

# Step 8: Test a single RAG query (before supervisor)
python -c "
from rag.pipeline import query
result = query('What is the PTO policy?', user_tier='standard')
print(result)
"

# Step 9: Start the supervisor (basic version)
python main.py
# Should respond to: 'What is our parental leave policy?'
# Should respond to: 'What is machine learning?' (routes to web search)

First Milestone Check

You should be able to run two queries:

"What is the PTO policy?" -> routed to Knowledge Agent -> retrieves from PGVector -> grounded answer
"What is a large language model?" -> routed to Research Agent -> returns web search result

If both of these work — your foundation is solid. Proceed to M2 components.

Implementation Order Matters

Build in this order: Infrastructure (Docker) -> RAG pipeline -> Knowledge Agent -> Supervisor (basic) -> Memory -> Remaining tools -> Reliability stack -> Tests -> Evaluation. Resist the urge to build the supervisor first. The supervisor is only as good as the agents it coordinates.