CORTEX
"The intelligent nervous system of your enterprise. Knows your documents. Remembers your employees. Connects to your tools. Never fails silently."
The Problem — It's Real, It's Expensive
Enterprise employees waste 2-4 hours per week searching for information that already exists somewhere in the company. HR policies, IT runbooks, compliance docs, pricing guides — scattered across Confluence, shared drives, email threads, Notion.
Junior staff don't know who to ask. Managers make decisions on stale data. Support teams re-answer the same questions 40 times a day. And when an AI chatbot is finally bought? It's a black box with no memory, no access control, and no way to know if it's actually working.
Real cost estimate: 500-person company x 2.5 hrs/week wasted x $80/hr avg salary = $5.2M/year in lost productivity.
What Cortex Solves
Cortex is a production-ready multi-agent AI platform that acts as an organisation's intelligent assistant. Employees ask questions in natural language. Cortex routes them to the right specialised agent, retrieves from the company knowledge base, remembers the user across sessions, takes actions through tools, and never returns an empty error.
It is the system you would actually build and deploy at a company — not a demo, not a notebook, not a PoC. A real production system with metrics, tests, memory, access control, and reliability engineering.
Three User Tiers — One System
Standard Employee
HR policies, IT help, onboarding docs, PTO queries, benefit information. Public + internal access tier.
Manager
Team data, budget queries, performance review docs, headcount planning. Internal + confidential access tier.
Executive
Strategic documents, financial reports, M&A briefings, board materials. Full access — all tiers.
Why This Problem? Why This Project?
Every company needs this. Every hiring manager knows this problem. When you say "I built an enterprise AI assistant with production RAG, multi-agent routing, persistent memory, and 18 automated tests" — they immediately understand the complexity and the value. This isn't a contrived exercise. This is the project that gets you the interview.
Why This is Portfolio-Worthy
Most AI portfolio projects are single-file notebooks that call an API. Cortex is an architecture. Here's exactly how it differentiates you.
Typical Portfolio Project
- Jupyter notebook
- One LLM call
- No error handling
- No tests
- No memory
- No metrics
- "It worked on my machine"
- Forgotten in 3 weeks
Cortex
- Multi-agent LangGraph system
- 7-layer RAG pipeline
- Rate limiting + 3-tier fallbacks
- 18+ automated tests
- Redis + PostgreSQL memory
- P@5 / MRR / NDCG metrics
- Docker + one-command deploy
- README with real benchmark numbers
The Interview Narrative
"Tell me about a project you're proud of."
"I built Cortex — a production-grade enterprise intelligence platform. It's a multi-agent system built on LangGraph: a supervisor agent routes queries to three specialised agents — a knowledge agent backed by a 7-layer RAG pipeline with PGVector and hybrid search, a research agent with web tool integrations, and an action agent for tickets and reports.
The reliability stack includes rate limiting with exponential backoff, 3-tier fallback chains, and per-query cost tracking with budget enforcement. The memory layer uses Redis for session context and PostgreSQL for long-term entity storage — so the system remembers what a user told it last week. I shipped an evaluation dashboard that prints Precision@5 and MRR on every test run. The test suite has 18 automated tests. It runs on Docker with one command."
That answer gets you to the technical round. Every time.
What Hiring Managers See
| Signal | What It Proves |
|---|---|
| 7-layer RAG pipeline | You understand production retrieval, not just "call OpenAI and pass the docs" |
| Multi-agent LangGraph | You understand orchestration and state machines, not just single-agent demos |
| Rate limiting + fallbacks | You've thought about failure modes — this separates seniors from juniors |
| Retrieval metrics (P@5, MRR) | You measure quality, not just "it looks right" |
| 18 automated tests | You write tests — most AI engineers don't |
| Redis + PostgreSQL memory | You understand persistence patterns, not just in-memory state |
| Docker + .env.example | You understand deployment and security basics |
| Versioned prompts (YAML) | You treat prompts as code — a rare and valued skill |
The Differentiator
Every week of this bootcamp teaches a concept in isolation. Cortex is the proof that you can combine them into a coherent system. The integration — making the RAG pipeline feed the knowledge agent which feeds the supervisor which feeds the memory layer — is harder than any individual component. Hiring managers know this.
System Architecture
Cortex has four layers: the Orchestration layer (Week 1/2), the Intelligence layer (Week 3), the Tools layer (Week 4), and the Infrastructure layer (Weeks 2/3/4).
Full System — User Query to Response
PGVector + Hybrid Search
RBAC access control
(W3)
Rate limiting + Backoff
3-tier fallbacks
(W4)
Report generation
Budget enforcement
(W4)
PostgreSQL: entity store (cross-session) (W3)
P@5/MRR/NDCG dashboard (W2+W3)
Data Flow — One Query
| Step | What Happens | Component |
|---|---|---|
| 1 | User sends "What's our parental leave policy?" with user_id=emp_123 | Entry point |
| 2 | Supervisor classifies intent -> "internal knowledge lookup" | Supervisor LangGraph |
| 3 | Supervisor checks access tier -> "standard employee" | Supervisor + RBAC |
| 4 | Routes to Knowledge Agent | Supervisor conditional edge |
| 5 | Query expanded: "parental leave" -> "maternity leave / paternity leave / family leave policy" | Layer 5 (query understanding) |
| 6 | Hybrid search: vector + BM25 + RRF -> top 5 docs retrieved | Layer 7 (hybrid search) |
| 7 | Access filter: confidential docs removed for standard tier | Layer 6 (RBAC) |
| 8 | Entity memory loaded: "Riya — HR enquiry history" (from PostgreSQL) | Memory layer |
| 9 | LLM generates grounded answer from retrieved docs + entity context | Knowledge Agent LLM |
| 10 | Response logged with cost + P@5 score. Session saved to Redis | Observability + Memory |
Skills Mapping: Weeks 1-4 -> Cortex
Every major concept from every week of Phase 1 is represented in the system. This is what makes Cortex a genuine integration project, not a standalone exercise.
| Week | Concept Taught | Where It Lives in Cortex |
|---|---|---|
| W1 | LangGraph state machine | Supervisor agent — the routing graph with conditional edges |
| W1 | @tool decorator | All 5 tools use @tool with proper schemas |
| W1 | Conditional routing (VIP/Standard) | 3-tier access routing: standard / manager / exec |
| W1 | AgentExecutor -> LangGraph upgrade | Full LangGraph graph replaces AgentExecutor |
| W2 | Prompt versioning (YAML) | prompts/supervisor/v1.0.0.yaml — every agent prompt versioned |
| W2 | Structured logging + cost tracking | observability/logger.py — logs every query with token cost |
| W2 | Supervisor vs peer-to-peer multi-agent | Supervisor -> Knowledge/Research/Action pattern |
| W2 | Graceful degradation | If Knowledge Agent fails -> Research Agent fallback |
| W2 | Prompt injection defense | Input sanitisation before passing to supervisor |
| W3 | 7-Layer Enterprise RAG (all 7) | Full pipeline in rag/ folder — L1 through L7 |
| W3 | PGVector + hybrid search (RRF) | Layer 4 storage + Layer 7 BM25+vector+RRF |
| W3 | Redis session memory | Sliding window (last 6 turns) per user session |
| W3 | PostgreSQL entity store | Long-term facts that survive server restarts |
| W3 | Retrieval metrics (P@5, MRR, NDCG) | Evaluation dashboard prints on every test run |
| W3 | RBAC access control | Layer 6 — filters docs by user tier |
| W4 | Tool schema design | All 5 tools — proper descriptions, types, error feedback |
| W4 | Rate limiting + exponential backoff | reliability/rate_limiter.py — wraps all external tool calls |
| W4 | 3-tier fallback chains | reliability/fallback.py — primary -> backup -> default response |
| W4 | Tool cost tracking + budgets | reliability/cost_tracker.py — per-query budget enforcement |
| W4 | Unit + integration tests | tests/unit/ and tests/integration/ — 18+ tests |
| W4 | LLM judge evaluation | tests/evaluation/test_llm_judge.py |
The Integration Is the Hard Part
Any individual component above is a 2-hour exercise from one session. Making them work together — the LangGraph supervisor calling the knowledge agent which runs the RAG pipeline which checks RBAC which loads entity context from PostgreSQL and session history from Redis — that is system design. That is what this project proves you can do.
Core Components
Supervisor Agent
LangGraph state machine. Receives every query. Classifies intent. Checks user tier. Routes to one of three agents. Handles graceful degradation if sub-agent fails. Versioned prompts.
W1 W2Knowledge Agent
Specialist for internal documents. Runs the full 7-layer RAG pipeline. Query understanding -> hybrid search -> RBAC filter -> grounded LLM answer. The core intelligence layer.
W3 (all 7 layers)Research Agent
Handles queries requiring external information. Web search with DuckDuckGo/SerpAPI. Rate limiting. 3-tier fallback. Returns synthesised external research.
W4Action Agent
Ticket creation, calendar queries, report generation. Validated tool schemas. Budget-aware. Every action logged. Error feedback to LLM when tools fail.
W4Memory Layer
Redis: sliding window of last 6 exchanges (session). PostgreSQL: entity store for facts that survive restarts (name, order IDs, preferences). Two-tier architecture.
W3Observability
Structured logging per query: user_id, agent_used, tokens, cost, latency. Retrieval dashboard: P@5, MRR, NDCG printed on test run. Cost budget enforcement.
W2 W3The 5 Tools
Each tool demonstrates a different Week 4 production pattern. Together they form the "tool suite" — analogous to the Week 4 research assistant assignment, but purpose-built for Cortex.
| Tool | What It Does | W4 Pattern | Failure Mode |
|---|---|---|---|
knowledge_base_search | Searches internal PGVector RAG. Returns top-K docs with relevance scores. | Proper tool schema + error feedback to LLM | Returns "no relevant docs found" — LLM-recoverable message |
web_search | DuckDuckGo (free) with SerpAPI fallback. Returns top 5 results with snippets. | Rate limiting + exponential backoff | Primary fails -> SerpAPI fallback -> cached last result |
create_support_ticket | Creates a Jira-like ticket. Returns ticket ID and estimated response time. | Input validation + 3-tier fallback | API down -> in-memory queue -> returns queue ID |
get_team_calendar | Returns availability for a team or person (mock data with realistic schedule). | Auth pattern + graceful failure | Returns "calendar unavailable, try again in 5 minutes" |
generate_report | Formats collected information into a structured report with sections. | Token budget enforcement | If budget exceeded -> returns summary only, not full report |
Why These 5 Specifically?
- knowledge_base_search — connects Week 4 tool design directly to the Week 3 RAG pipeline. The integration between layers is what makes this Cortex, not just a research assistant.
- web_search — the most failure-prone real-world tool. Rate limits, costs, flaky APIs. Perfect vehicle for teaching rate limiting and fallbacks.
- create_support_ticket — represents write operations. Different failure contract than read operations. Tests idempotency thinking.
- get_team_calendar — authentication and token-based auth pattern. Shows secrets management in action.
- generate_report — output tool. Tests budget enforcement — some queries should produce full reports, others a summary if tokens are expensive.
Folder Structure
Clean, modular, production-aligned. Every folder has a single responsibility. The structure itself communicates that you understand software architecture.
cortex/
├── README.md # Portfolio write-up with metrics + architecture diagram
├── docker-compose.yml # PGVector + Redis — starts with one command
├── requirements.txt
├── .env.example # Never commit real keys
├── main.py # Entry point: start Cortex
│
├── agents/
│ ├── supervisor.py # LangGraph state machine (W1 + W2)
│ ├── knowledge_agent.py # RAG specialist — calls rag/ pipeline (W3)
│ ├── research_agent.py # Web research + rate limiting (W4)
│ └── action_agent.py # Tools: ticket, calendar, report (W4)
│
├── rag/
│ ├── pipeline.py # Orchestrates all 7 layers
│ ├── ingestion/ # L1: PDF/docx/txt processing
│ │ └── document_loader.py
│ ├── chunking.py # L2: semantic chunking
│ ├── embeddings.py # L3: OpenAI / HuggingFace
│ ├── vector_store.py # L4: PGVector CRUD
│ ├── query_understanding.py # L5: reformulation + expansion + intent
│ ├── access_control.py # L6: RBAC tier filter
│ ├── hybrid_search.py # L7: BM25 + vector + RRF
│ └── evaluation.py # P@5 / MRR / NDCG dashboard
│
├── memory/
│ ├── session_memory.py # Redis sliding window (W3)
│ └── entity_store.py # PostgreSQL long-term entities (W3)
│
├── tools/
│ ├── web_search.py # DuckDuckGo + SerpAPI fallback (W4)
│ ├── ticketing.py # Ticket creation with validation (W4)
│ ├── calendar.py # Calendar with auth pattern (W4)
│ └── report_generator.py # Budget-enforced report tool (W4)
│
├── reliability/
│ ├── rate_limiter.py # Token bucket + exponential backoff (W4)
│ ├── fallback.py # 3-tier fallback chains (W4)
│ └── cost_tracker.py # Per-query budget enforcement (W4)
│
├── prompts/
│ ├── supervisor/
│ │ ├── v1.0.0.yaml # Version 1 (W2)
│ │ └── v1.1.0.yaml # A/B test variant
│ └── agents/
│ ├── knowledge_agent/v1.0.0.yaml
│ └── research_agent/v1.0.0.yaml
│
├── observability/
│ ├── logger.py # Structured logging: user, agent, cost (W2)
│ └── metrics.py # Aggregated cost + quality dashboard
│
├── data/
│ └── sample_knowledge_base/ # 10 sample docs: HR, IT, policy, finance
│
└── tests/
├── unit/
│ ├── test_tools.py # Tool schema + error handling (W4)
│ ├── test_rate_limiter.py
│ ├── test_fallback.py
│ ├── test_cost_tracker.py
│ └── test_rag_pipeline.py
├── integration/
│ ├── test_supervisor_routing.py
│ ├── test_knowledge_agent.py
│ └── test_full_pipeline.py
└── evaluation/
└── test_llm_judge.py # LLM-as-judge (W4)
Milestone 1 — Core System
Target: End of Week 4 | "It works."
The foundational plumbing. Supervisor routes. Knowledge Agent retrieves. You can have a multi-turn conversation that returns grounded answers from your document set. Memory holds context for the session. Logging tracks costs.
M1 Checklist
- Docker Compose running: PGVector + Redis both healthy
- Sample knowledge base loaded: 10 documents ingested into PGVector
- Supervisor agent implemented: LangGraph graph with 3 nodes (supervisor, knowledge, fallback)
- Knowledge Agent working: runs full 7-layer RAG pipeline end-to-end
- Basic memory: Redis session sliding window stores last 6 messages
- 2 tools functional: knowledge_base_search + web_search
- Structured logging: every query logs user_id, agent, tokens, cost
- Can answer: "What's the PTO policy?" (internal) and "What is RAG?" (web)
- 3 unit tests passing for core components
Common Mistakes to Avoid
- Don't skip the Docker setup — PGVector must be running or RAG doesn't work. Run
docker-compose up -das your first step. - Don't make the supervisor too complex — start with a simple 3-state graph (route to knowledge, route to research, or say "I don't know"). You can add more states in M2.
- Don't try to do M2 in M1 — P@5 measurement, all 5 tools, and PostgreSQL entity memory are M2. Get the core working first.
Milestone 2 — Full Production Stack
Target: 1 Week After M1 | "It doesn't break."
The full reliability and memory stack. All 5 tools. Rate limiting wrapping every external call. 3-tier fallbacks. PostgreSQL entity memory persisting across sessions. Versioned prompts. Budget enforcement. The system survives realistic failure scenarios.
M2 Checklist
- All 5 tools implemented with proper @tool schemas
- Rate limiter wrapping web_search and external API calls
- 3-tier fallback chain: primary -> backup -> graceful default
- PostgreSQL entity store: entities extracted and persisted across sessions
- Versioned prompts: supervisor and all agent prompts in YAML files
- Token budget enforcement: per-query limit, stops before going over
- Research Agent and Action Agent fully functional
- Graceful degradation: if Knowledge Agent fails, system doesn't crash
- 10+ unit tests passing (tools, rate limiter, fallback, cost tracker)
- Server restart test: session restored from Redis, entities from PostgreSQL
M2 Definition of Done Test
Run this scenario to verify M2 is complete:
- Start a conversation as "standard employee Riya, order ORD-789"
- Kill the process (Ctrl+C)
- Restart the process with the same session_id
- Ask "What was my order number?" — it should answer ORD-789
- Search for something external with web_search — should not crash even if the primary API is rate-limited
- Ask for a report that would exceed your token budget — should return summary instead
Milestone 3 — Portfolio Ready
Target: Before Week 5 | "I can defend every decision."
The evaluation layer, the full test suite, and the portfolio packaging. P@5 and MRR numbers on a test run. LLM judge evaluation scoring response quality. A clean README with real benchmark numbers. An architecture diagram. The project is ready to show to a hiring manager.
M3 Checklist
- Retrieval evaluation: 20-query golden dataset with labelled relevant docs
- P@5 >= 0.70 achieved (if below 0.70 — fix chunking or query expansion first)
- MRR >= 0.65 achieved
- Evaluation dashboard prints on
pytest tests/evaluation/run - LLM judge test: 5 queries evaluated for faithfulness and relevance
- Full test suite: 18+ tests, all passing
- Architecture diagram in README (draw.io or mermaid)
- README sections: Problem, Solution, Architecture, How to Run, Benchmark Numbers
- Docker:
docker-compose up && python main.pystarts the full system .env.examplewith all required variables documented- Video demo (optional stretch): 3-minute Loom showing a full multi-turn conversation
What Makes M3 Different
M1 proves it works. M2 proves it's reliable. M3 proves you can measure it and communicate it. A hiring manager won't run your code. They'll read your README and look at your numbers. M3 is where you package the work into something that communicates its value without a demo.
Success Metrics
These are the numbers your README should contain and your test suite should produce automatically.
Retrieval Quality
Reliability Targets
Memory Correctness
| Test Scenario | Expected Result |
|---|---|
| Server restart mid-session | Entity memory restored from PostgreSQL |
| Turn 8: "What was my order number?" (told in turn 1) | Correct entity retrieved |
| New session, same user | Long-term entities loaded from PostgreSQL |
| Standard employee queries financial report | Access denied — RBAC filters doc |
Grading Rubric
Maximum: 100 points. Portfolio-worthy threshold: 75 points. Distinction: 90 points.
| Category | Max | Excellent (Full) | Good (70%) | Acceptable (50%) |
|---|---|---|---|---|
| Architecture (W1/W2) | 15 | LangGraph supervisor working, all 3 agents routing correctly, conditional edges for all 3 user tiers | Supervisor routes to 2 agents, basic conditional routing | Simple if/else routing without LangGraph |
| RAG Pipeline (W3) | 20 | All 7 layers, hybrid search, P@5 >= 0.70 demonstrated | 5+ layers, basic vector search, P@5 >= 0.55 | 3 layers (chunk, embed, retrieve), no metrics |
| Memory System (W3) | 15 | Redis + PostgreSQL both working, server restart test passes, entity extraction demonstrated | Redis session working, no PostgreSQL | In-memory only — buffer memory |
| Tools & Reliability (W4) | 20 | All 5 tools, rate limiting on external calls, 3-tier fallbacks, budget enforcement all working | 3+ tools, rate limiting working, no fallbacks | 2 tools, no reliability patterns |
| Test Suite (W4) | 15 | 18+ tests passing, unit + integration + LLM judge evaluation | 10+ tests, unit only | 5+ tests, mostly smoke tests |
| Observability (W2/W3) | 5 | Cost per query logged, P@5/MRR printed on eval run | Cost logged, no retrieval metrics | Basic print logging only |
| Portfolio Packaging | 10 | Docker works, README has metrics, architecture diagram, clear run instructions | README exists, no metrics, no diagram | Code only, no documentation |
Bonus Points (up to +10)
- +3 — Prompt A/B testing framework working (two prompt variants, comparison output)
- +3 — Video demo (3-minute Loom showing multi-turn conversation end-to-end)
- +4 — Streamlit or FastAPI front-end for Cortex (even simple chat interface)
Getting Started — Your First 2 Hours
Follow this sequence exactly. Don't skip ahead to the interesting parts before the plumbing is working.
Hour 1 — Infrastructure & Skeleton
# Step 1: Create the project folder and git init mkdir cortex && cd cortex && git init # Step 2: docker-compose.yml docker-compose up -d docker ps # Should show pgvector and redis both running # Step 3: Install dependencies pip install langchain langgraph langchain-openai langchain-community pip install pgvector psycopg2-binary redis python-dotenv pip install rank-bm25 pytest pytest-mock pyyaml # Step 4: Create .env from example cp .env.example .env # Fill in OPENAI_API_KEY — everything else has defaults # Step 5: Create folder skeleton mkdir -p agents rag/ingestion memory tools reliability \ prompts/supervisor observability tests/unit \ tests/integration data/sample_knowledge_base
Hour 2 — First Working Query
# Step 6: Copy the sample documents (from Solution_Code_Snippets/data/)
# 10 documents: hr_policy.txt, it_runbook.txt, parental_leave.txt, etc.
# Step 7: Ingest documents into PGVector
python -c "from rag.pipeline import ingest_documents; \
ingest_documents('data/sample_knowledge_base')"
# Step 8: Test a single RAG query (before supervisor)
python -c "
from rag.pipeline import query
result = query('What is the PTO policy?', user_tier='standard')
print(result)
"
# Step 9: Start the supervisor (basic version)
python main.py
# Should respond to: 'What is our parental leave policy?'
# Should respond to: 'What is machine learning?' (routes to web search)
First Milestone Check
You should be able to run two queries:
- "What is the PTO policy?" -> routed to Knowledge Agent -> retrieves from PGVector -> grounded answer
- "What is a large language model?" -> routed to Research Agent -> returns web search result
If both of these work — your foundation is solid. Proceed to M2 components.
Implementation Order Matters
Build in this order: Infrastructure (Docker) -> RAG pipeline -> Knowledge Agent -> Supervisor (basic) -> Memory -> Remaining tools -> Reliability stack -> Tests -> Evaluation. Resist the urge to build the supervisor first. The supervisor is only as good as the agents it coordinates.