By the Umaku Engineering Team

At Omdena, we manage hundreds of concurrent AI projects through Umaku, our AI-integrated project management platform. Our teams aren’t just moving tickets on a board; they are pushing code across diverse stacks—from React Native frontends to complex Python backend agents and Jupyter Notebooks filled with data science experiments.

We wanted to build a “Context-Aware Chatbot” that could answer questions like “Why is the authentication module failing in the latest sprint?” or “Who is the best person to fix this LangChain bug?”

However, we quickly realized that standard RAG (Retrieval-Augmented Generation) pipelines fail miserably at code. Naive chunking breaks function definitions, ignores file paths, and chokes on the JSON structure of Jupyter Notebooks.

In this four-part series, we will break down the architecture of our Semantic Code Search Engine. We’ll share how we moved beyond simple keyword matching to a system that understands the structure of code and the people behind it.

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 1

In Part 1, we dive into the Ingestion Engine: why we built a custom Commit Chunker API and how we solved the “messy data” problem of real-world repositories.

The Problem: Why “Naive” RAG Breaks on Code

When building a RAG system for documentation, you can usually get away with splitting text every 500 tokens. Code is different. It is highly structured, interdependent, and brittle.

We identified three critical failures when using off-the-shelf splitters (like standard LangChain or LlamaIndex loaders) for our use case:

Context Fragmentation: A standard splitter might cut a 50-line Python function in half. If the LLM only retrieves the bottom half, it loses the function signature and decorators, making the code hallucination-prone.
The “Notebook” Nightmare: A .ipynb file is technically a giant JSON object. Naive splitters treat it as raw text, feeding the LLM useless metadata like execution_count and base64-encoded images instead of the actual logic.
Loss of Authorship: In a standard vector store, a chunk of code is anonymous. But in Umaku, who wrote the code is just as important as what the code does. We needed to link every vector embedding back to a specific commit SHA and a Umaku User ID.

To solve this, we built the GitHub Commit Chunker API, a specialized FastAPI service designed to treat code as a first-class citizen.

The Solution: A “Smart” Ingestion Pipeline

Our solution isn’t just a script; it’s a dedicated microservice that sits between GitHub and our Vector Database (Pinecone). It adheres to a strict set of core principles designed to preserve semantic integrity.

Principle 1: One File = At Least One Chunk

We enforce a rule that small files (≤3000 characters) are never split. This ensures that config files (like package.json or Dockerfile) remain atomic. You never want to retrieve half a JSON file.

Principle 2: Language-Aware Splitting

For files larger than 3000 characters, we don’t just split by character count. We use the LlamaIndex CodeSplitter, which utilizes tree-sitter parsers to understand the abstract syntax tree (AST) of the language. This ensures we split at logical boundaries—classes and functions—rather than in the middle of a for loop.

Principle 3: First-Class Jupyter Support

Since Omdena projects are AI-heavy, roughly 40% of our code experiments lives in Jupyter Notebooks. Our extractor parses the raw JSON of the .ipynb file, discards markdown cells and output binaries, and extracts only the executable code cells before chunking.

Architecture: The Ingestion Logic

The logic is encapsulated in our GitHubRepoExtractor class. Below is the decision matrix we use for every single file in a commit.

Figure 1: The decision matrix behind our Smart Chunking API. Unlike standard splitters, we enforce file integrity and treat Jupyter Notebooks as first-class citizens in the ingestion pipeline. Specifically, note the separate path for .ipynb files which bypasses standard text parsing to avoid “JSON noise” in our embeddings.

The Code Behind the Logic

Here is a simplified look at how we handle the extraction strategy within our FastAPI service. We utilize a CodeSplitter that dynamically adjusts based on the file extension.

Python

def get_file_content(self, file_path, content):
    """Intelligent content extraction based on file type."""
    if file_path.endswith('.ipynb'):
        # Special handling for Notebooks: Extract code cells only
        return self.read_notebook_content(content)
    else:
        # Standard text decoding for code files
        return content.decode('utf-8')

def chunk_file(self, file_path, content, max_chars=3000):
    """
    Applies the 'Smart Splitting' logic.
    """
    # Rule 1: Atomic Chunks for small files
    if len(content) <= max_chars:
        return [Chunk(content=content, id=f"{self.commit_sha}_0")]
    
    # Rule 2: AST-based splitting for large files
    try:
        splitter = CodeSplitter(
            language=self.get_language(file_path),
            chunk_lines=40,  # Semantic window size
            chunk_lines_overlap=15,
            max_chars=max_chars
        )
        raw_chunks = splitter.split_text(content)
        return [
            Chunk(content=c, id=f"{self.commit_sha}_{i}") 
            for i, c in enumerate(raw_chunks)
        ]
    except Exception:
        # Fallback to simple character splitting if parser fails
        return self.simple_split(content)

The Pipeline: From Commit to Vector

Building the extractor is only step one. In a live environment like Umaku, this needs to happen asynchronously every time a developer pushes code.

We implemented an event-driven architecture. When a commit hits a tracked branch, a webhook triggers our Chunker API. The API processes the files and returns structured JSON objects that include not just the code, but the context metadata: commit_sha, author_id, file_path, and change_status (added/modified/deleted).

Figure 2: The asynchronous ingestion pipeline. Commits are processed in real-time. The Chunker API (FastAPI) acts as the transformation layer, converting raw GitHub blobs into semantically rich chunks before they are embedded by our model fleet (OpenAI/Voyage) and indexed in Pinecone.

Why Metadata Matters

Notice that we don’t just store the text. We store the author_id. This is the foundational layer for the “People” aspect of our search engine. By indexing the committer_author_id alongside the code vector, we prepare the system for complex queries like “Show me the changes Ahmed made to the payment service last week.”

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 2

In Part 1, we detailed how we built the GitHub Commit Chunker API to solve the “Garbage In, Garbage Out” problem. We now have a clean, structured stream of code chunks—where functions are kept intact, Jupyter Notebooks are parsed correctly, and every line of code is traceable to a specific commit.

But a clean JSON object is not a search engine. To allow our AI Agents to answer questions like “How does the payment retry logic work?” or “Who introduced the latency in the last sprint?”, we need to translate these code blocks into a format the machine understands: Vectors.

In Part 2, we explore our indexing strategy. We will cover why we use specialized embedding models for code, how we utilize Pinecone Serverless for scale, and—most importantly—how we built the People-Code Graph to link anonymous Git commits to real Umaku user profiles.

The Embedding Strategy: Not All Vectors Are Created Equal

Standard NLP embeddings (like the older BERT models) are great for English text but struggle with the syntax of Python or TypeScript. Code relies heavily on structure, variable naming conventions, and logic flow, which doesn’t always map 1:1 to natural language.

At Umaku, we adopted a multi-model strategy to ensure high-fidelity retrieval. We evaluated several leaders in the space:

OpenAI text-embedding-3-large: Our generalist workhorse. It excels at understanding the “intent” behind the code, specifically the docstrings and comments.
Voyage AI (voyage-code-2): Currently the state-of-the-art for code retrieval. It is specifically trained on code-heavy datasets, allowing it to understand that def auth_user() is semantically related to “login” even if the word “login” never appears in the function.
Jina AI (jina-embeddings-v2-base-code): We utilize this for larger context requirements. Its 8k token sequence length is crucial when we need to embed larger, monolithic configuration files that shouldn’t be split.

By generating embeddings using models optimized for code, we ensure that a query for “Database Connection” retrieves PostgresClient.connect() rather than just text files containing the word “database.”

The “People” Layer: Mapping Git Ghosts to Umaku Users

This is where standard RAG pipelines stop, and where Umaku begins.

In a standard repository, code is owned by a committer_author_id. This is often a shorthand (e.g., ahmed-dev) or a personal email (ahmed@gmail.com). However, in the Umaku platform, “Ahmed” is a Project Manager with a specific role, a history of completed sprints, and a specific set of skills.

If a user asks the chatbot, “How is Ahmed performing on the backend tasks?”, the system cannot answer if it only knows about ahmed-dev.

We solved this by building an Identity Resolution Layer that sits before the vector database.

The Identity Resolution Graph

When our Ingestion Engine processes a commit, it doesn’t just embed the code. It performs a lookup against the Umaku User Database. It attempts to map the Git email/handle to a Umaku Profile ID.

Figure 3: The Umaku Identity Graph. We don’t just index code; we index the relationships between developers, their commits, and their project management tickets. This graph allows the system to traverse from a natural language query about a person (“Ahmed”) to their specific digital exhaust (Git Identity) and finally to the code artifacts they authored.

This mapping allows us to “hydrate” our vector metadata. Instead of a blind vector, our payload to the database looks like this:

JSON

{
  "id": "commit_7f3a1_0",
  "values": [0.02, -0.15, 0.88, ...], 
  "metadata": {
    "file_path": "backend/auth/login.py",
    "language": "python",
    "commit_sha": "7f3a1...",
    "git_author": "ahmed-dev",
    "Umaku_user_id": "USR-8821-X",  // <--- The Critical Link
    "sprint_id": "SPRINT-24",
    "repo_context": "payment-service"
  }
}

By embedding this Umaku User ID directly into the vector metadata, we unlock powerful filtering capabilities. The AI Agent can now execute queries like:

Filter: Umaku_user_id == ‘USR-8821-X’ AND sprint_id == ‘SPRINT-24’
Query: “Code quality issues and bugs.”

The Indexing Workflow

The final indexing workflow combines the Chunker API (from Part 1) with this metadata enrichment:

Chunk: Receive code chunk from GitHubRepoExtractor.
Enrich: Query Umaku DB to find the Umaku_user_id associated with the commit author.
Embed: Send text to Voyage AI / OpenAI to get the vector representation.
Upsert: Push the Vector + Enriched Metadata to Pinecone.

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 3

In Part 1, we built a “smart” ingestion engine to chunk code while preserving file integrity. In Part 2, we enriched those chunks with “People” metadata and indexed them in Pinecone.

Now comes the hardest part: Retrieval.

The reality of RAG (Retrieval-Augmented Generation) is that users are terrible at searching. A developer won’t type: “Select top 5 chunks from namespace backend-repo where cosine_similarity(embedding, ‘AuthService’) > 0.8.”

Instead, they type: “Why is the login broken?”

If we feed that raw string into a vector database, we get noise. The word “broken” rarely appears in code (hopefully), and “login” is too generic. To bridge the gap between vague human intent and precise code execution, we engineered a multi-stage Advanced Retrieval Architecture.

In Part 3, we break down the “Brain” of Umaku: Query Rewriting, Hallucinated Code (HyDE), and our Reranking pipeline.

Brain of Umaku: Query Rewriting, Hallucinated Code (HyDE), and our Reranking pipeline

Stage 1: The “Router” (Intent Classification)

Before we search, we need to know what we are looking for. We utilize a lightweight routing agent that classifies the user’s prompt into one of three domains:

Syntactic/Code: “How is the retry_logic implemented?” (Requires Vector Search on Codebase).
People/Management: “Who worked on the payment module in Sprint 4?” (Requires Graph Lookup + Umaku DB).
Status: “Are there any blockers for the frontend team?” (Requires Board API).

This routing step prevents us from wasting compute on semantic code search when the user just wants to know who is on the team.

Stage 2: Query Transformation (The “Complex” Part)

Once the router identifies a code-related query (e.g., “Why is the login broken?”), we don’t just search for it. We transform it.

We employ two specific techniques to expand the search space:

A. Query Rewriting (Expansion)

We use an LLM call to expand the query into domain-specific terms. The rewriter analyzes the project stack (known from Umaku metadata) and adds synonyms.

Original: “Why is login broken?”
Rewritten: “AuthenticationService failure”, “LoginController exceptions”, “JWT validation errors”, “Auth middleware timeout”.

B. HyDE (Hypothetical Document Embeddings)

This is our secret weapon for code search. HyDE assumes that a good way to find a code snippet is to write a fake code snippet that looks like it.

When a user asks “How do we handle PDF parsing?”, the HyDE agent generates a hallucinated, dummy Python function using standard libraries (e.g., PyPDF2). We then embed this fake function and search our vector database for real code that looks mathematically similar to the fake code.

This effectively bypasses the language barrier. We aren’t matching English to Python; we are matching (Fake) Python to (Real) Python.

Figure 4: Moving beyond simple vector search. Our retrieval pipeline utilizes Query Expansion, Hypothetical Document Embeddings (HyDE), and Cross-Encoder Reranking to ensure the LLM receives only the most relevant context. Note how “Hybrid Search” combines dense vectors (concept matching) with sparse vectors (exact keyword matching) to catch specific variable names.

Stage 3: Hybrid Search & Reranking

We now have a set of robust queries. We execute them against Pinecone using Hybrid Search (Dense Vectors + Sparse BM25). This gives us a broad net—perhaps 50 candidate chunks.

However, LLM context windows are precious. We can’t feed 50 chunks to the model; it will get “lost in the middle.”

The Reranker

We introduce a Cross-Encoder Reranker (we utilize Cohere Rerank v3 or BGE-Reranker-M3) at the end of the pipeline.

The Reranker takes the user’s original question and the 50 candidate chunks, and it scores them one by one based on strict relevance. It acts as a semantic filter, ruthlessly discarding false positives.

Candidate 1 (Score 0.1): A comment mentioning “login” in a CSS file. -> REJECT
Candidate 2 (Score 0.98): The AuthController.ts file handling the actual logic. -> KEEP

We reduce the top 50 down to the “Golden Top 5” before sending them to the generation model.

Stage 4: Multi-Agent Orchestration

Retrieval is rarely linear. Sometimes, finding the code isn’t enough; you need to know the context around the code.

Umaku uses a Multi-Agent Orchestrator pattern. When a complex query arrives—“Did Ahmed fix the bug in the auth service that was reported last sprint?”—single-step retrieval fails.

The Orchestrator breaks this into sub-tasks:

Umaku Agent: Query the board. “Find bugs assigned to Ahmed in Sprint X related to ‘auth’.” -> Returns Ticket ID NEX-101.
Code Agent: “Fetch the code associated with the commit linked to ticket NEX-101.”
Synthesizer: Compare the fetched code against the bug description.

Figure 5: The Orchestrator Pattern. A central router dynamically delegates tasks to specialized sub-agents. This allows the system to answer composite questions that require data from both the Project Management platform (Umaku) and the codebase (GitHub), merging them into a single coherent answer.

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 4

In this part and the previous parts of this series, we built a sophisticated engine under the hood. We created a Smart Ingestion Pipeline (Part 1), a People-Code Knowledge Graph (Part 2), and an Advanced Retrieval Brain using HyDE and Rerankers (Part 3).

But a Ferrari engine is useless without a steering wheel.

For our developers at Omdena, “context switching” is the enemy. If a frontend engineer working in Cursor has to minimize their IDE, open the Umaku dashboard, and type a query to find out how the backend API works, we have failed.

In this final part, we reveal how we brought the Umaku Search Engine directly into the developer’s environment using the Model Context Protocol (MCP) and how we closed the loop with Agentic Sprint Reports.

The “Cursor” Integration: Bringing Umaku to the Editor

We didn’t want to build a custom VS Code extension from scratch. Instead, we adopted the Model Context Protocol (MCP), an open standard that allows AI models (like Claude 3.5 Sonnet running inside Cursor) to talk to external data servers.

The MCP Architecture

We built a lightweight Umaku MCP Server using Python. It exposes our retrieval pipeline as a set of “Tools” that Cursor can discover and invoke via JSON-RPC.

When a developer in Cursor types:

@Umaku fetch the interface for the payment user profile

The following handshake occurs invisibly:

Cursor (Client) sends a JSON-RPC request to our local MCP Server.
MCP Server calls the Umaku Retrieval API (the one we built in Part 3).
Umaku API performs the vector search, reranking, and authorship lookup.
MCP Server returns the code snippets + the name of the author (e.g., “Ahmed”) back to Cursor.

Figure 6: Bridging the gap with MCP. The Model Context Protocol acts as a universal translator. Developers working in Cursor on the frontend can transparently query the Umaku backend codebase. The diagram illustrates the JSON-RPC flow where the IDE (Client) requests context, and our Umaku Service acts as the Server, delivering “hydrated” code snippets without the user ever leaving their editor.

Why This Changes Everything

This integration solves the “Blind Frontend” problem. A React Native developer usually treats the backend as a black box. With Umaku MCP, they can highlight a function call in their editor and ask: “What data shape does this endpoint expect?”

Umaku retrieves the actual Pydantic model from the backend repo and inserts it into their chat context. No Swagger docs needed. No Slack messages sent.

Closing the Loop: Agentic Sprint Reports

Retrieval is reactive. We wanted Umaku to be proactive.

Every two weeks, a sprint ends. Traditionally, a Project Manager spends hours compiling a report: What was done? What was descoped? Who overperformed?

We built a fleet of Observer Agents that run automatically when a sprint is marked “Closed” in Umaku.

The Agent Fleet

The Descoping Agent: Scans the board for tasks moved to “Backlog.” It checks the git logs—did any code get written for these? If git log shows zero activity but the time logs show 10 hours, it flags a “Ghost Work” anomaly.
The Bug Hunter: Correlations the Bugs column with specific commit clusters. It identifies “Hot Files”—files that were touched frequently and resulted in regressions.
The DevOps Auditor: Checks if the new code introduced any security smells (e.g., hardcoded keys) by scanning the chunks indexed in our vector store.

The Feedback Loop

These agents generate a structured Sprint Retrospective PDF that is delivered to the PM. This isn’t just a summary; it’s a graded assessment.

Sprint Score: 85/100
Velocity: +12%
Code Quality: “Warning: The AuthService file is becoming a ‘God Object’ (too many modifications by 5 different people).”

Figure 7: The Continuous Improvement Loop. AI Agents act as autonomous auditors. At the end of every sprint, they analyze the “Chunked Codebase” and “Board Status” to generate actionable insights. This feedback is fed into the next sprint’s planning session, ensuring that technical debt is identified before it becomes a crisis.

Conclusion: The Future of Code Search

We started this journey with a simple goal: make code searchable. We ended up building a system that understands the life of a project.

By moving Beyond Keywords, we enabled our AI to understand that code isn’t just text—it’s logic.

By indexing People, we acknowledged that software is a human endeavor.

And by integrating with MCP, we met our developers where they live.

At Omdena, this architecture has reduced our “Time to Resolution” for technical questions by 40%. New onboarders no longer feel lost; they have a context-aware mentor available 24/7.

Umaku isn’t just managing our projects anymore. It’s understanding them.

Interested in building agentic workflows? Join our community at Omdena.

By the Umaku Engineering Team

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 1

In Part 1, we dive into the Ingestion Engine: why we built a custom Commit Chunker API and how we solved the “messy data” problem of real-world repositories.

The Problem: Why “Naive” RAG Breaks on Code

When building a RAG system for documentation, you can usually get away with splitting text every 500 tokens. Code is different. It is highly structured, interdependent, and brittle.

We identified three critical failures when using off-the-shelf splitters (like standard LangChain or LlamaIndex loaders) for our use case:

Context Fragmentation: A standard splitter might cut a 50-line Python function in half. If the LLM only retrieves the bottom half, it loses the function signature and decorators, making the code hallucination-prone.
The “Notebook” Nightmare: A .ipynb file is technically a giant JSON object. Naive splitters treat it as raw text, feeding the LLM useless metadata like execution_count and base64-encoded images instead of the actual logic.
Loss of Authorship: In a standard vector store, a chunk of code is anonymous. But in Umaku, who wrote the code is just as important as what the code does. We needed to link every vector embedding back to a specific commit SHA and a Umaku User ID.

To solve this, we built the GitHub Commit Chunker API, a specialized FastAPI service designed to treat code as a first-class citizen.

The Solution: A “Smart” Ingestion Pipeline

Principle 1: One File = At Least One Chunk

We enforce a rule that small files (≤3000 characters) are never split. This ensures that config files (like package.json or Dockerfile) remain atomic. You never want to retrieve half a JSON file.

Principle 2: Language-Aware Splitting

Principle 3: First-Class Jupyter Support

Architecture: The Ingestion Logic

The logic is encapsulated in our GitHubRepoExtractor class. Below is the decision matrix we use for every single file in a commit.

The Code Behind the Logic

Here is a simplified look at how we handle the extraction strategy within our FastAPI service. We utilize a CodeSplitter that dynamically adjusts based on the file extension.

Python

def get_file_content(self, file_path, content):
    """Intelligent content extraction based on file type."""
    if file_path.endswith('.ipynb'):
        # Special handling for Notebooks: Extract code cells only
        return self.read_notebook_content(content)
    else:
        # Standard text decoding for code files
        return content.decode('utf-8')

def chunk_file(self, file_path, content, max_chars=3000):
    """
    Applies the 'Smart Splitting' logic.
    """
    # Rule 1: Atomic Chunks for small files
    if len(content) <= max_chars:
        return [Chunk(content=content, id=f"{self.commit_sha}_0")]
    
    # Rule 2: AST-based splitting for large files
    try:
        splitter = CodeSplitter(
            language=self.get_language(file_path),
            chunk_lines=40,  # Semantic window size
            chunk_lines_overlap=15,
            max_chars=max_chars
        )
        raw_chunks = splitter.split_text(content)
        return [
            Chunk(content=c, id=f"{self.commit_sha}_{i}") 
            for i, c in enumerate(raw_chunks)
        ]
    except Exception:
        # Fallback to simple character splitting if parser fails
        return self.simple_split(content)

The Pipeline: From Commit to Vector

Building the extractor is only step one. In a live environment like Umaku, this needs to happen asynchronously every time a developer pushes code.

Why Metadata Matters

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 2

The Embedding Strategy: Not All Vectors Are Created Equal

At Umaku, we adopted a multi-model strategy to ensure high-fidelity retrieval. We evaluated several leaders in the space:

OpenAI text-embedding-3-large: Our generalist workhorse. It excels at understanding the “intent” behind the code, specifically the docstrings and comments.
Voyage AI (voyage-code-2): Currently the state-of-the-art for code retrieval. It is specifically trained on code-heavy datasets, allowing it to understand that def auth_user() is semantically related to “login” even if the word “login” never appears in the function.
Jina AI (jina-embeddings-v2-base-code): We utilize this for larger context requirements. Its 8k token sequence length is crucial when we need to embed larger, monolithic configuration files that shouldn’t be split.

The “People” Layer: Mapping Git Ghosts to Umaku Users

This is where standard RAG pipelines stop, and where Umaku begins.

If a user asks the chatbot, “How is Ahmed performing on the backend tasks?”, the system cannot answer if it only knows about ahmed-dev.

We solved this by building an Identity Resolution Layer that sits before the vector database.

The Identity Resolution Graph

When our Ingestion Engine processes a commit, it doesn’t just embed the code. It performs a lookup against the Umaku User Database. It attempts to map the Git email/handle to a Umaku Profile ID.

This mapping allows us to “hydrate” our vector metadata. Instead of a blind vector, our payload to the database looks like this:

JSON

{
  "id": "commit_7f3a1_0",
  "values": [0.02, -0.15, 0.88, ...], 
  "metadata": {
    "file_path": "backend/auth/login.py",
    "language": "python",
    "commit_sha": "7f3a1...",
    "git_author": "ahmed-dev",
    "Umaku_user_id": "USR-8821-X",  // <--- The Critical Link
    "sprint_id": "SPRINT-24",
    "repo_context": "payment-service"
  }
}

By embedding this Umaku User ID directly into the vector metadata, we unlock powerful filtering capabilities. The AI Agent can now execute queries like:

Filter: Umaku_user_id == ‘USR-8821-X’ AND sprint_id == ‘SPRINT-24’
Query: “Code quality issues and bugs.”

The Indexing Workflow

The final indexing workflow combines the Chunker API (from Part 1) with this metadata enrichment:

Chunk: Receive code chunk from GitHubRepoExtractor.
Enrich: Query Umaku DB to find the Umaku_user_id associated with the commit author.
Embed: Send text to Voyage AI / OpenAI to get the vector representation.
Upsert: Push the Vector + Enriched Metadata to Pinecone.

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 3

In Part 1, we built a “smart” ingestion engine to chunk code while preserving file integrity. In Part 2, we enriched those chunks with “People” metadata and indexed them in Pinecone.

Now comes the hardest part: Retrieval.

Instead, they type: “Why is the login broken?”

In Part 3, we break down the “Brain” of Umaku: Query Rewriting, Hallucinated Code (HyDE), and our Reranking pipeline.

Brain of Umaku: Query Rewriting, Hallucinated Code (HyDE), and our Reranking pipeline

Stage 1: The “Router” (Intent Classification)

Before we search, we need to know what we are looking for. We utilize a lightweight routing agent that classifies the user’s prompt into one of three domains:

Syntactic/Code: “How is the retry_logic implemented?” (Requires Vector Search on Codebase).
People/Management: “Who worked on the payment module in Sprint 4?” (Requires Graph Lookup + Umaku DB).
Status: “Are there any blockers for the frontend team?” (Requires Board API).

This routing step prevents us from wasting compute on semantic code search when the user just wants to know who is on the team.

Stage 2: Query Transformation (The “Complex” Part)

Once the router identifies a code-related query (e.g., “Why is the login broken?”), we don’t just search for it. We transform it.

We employ two specific techniques to expand the search space:

A. Query Rewriting (Expansion)

We use an LLM call to expand the query into domain-specific terms. The rewriter analyzes the project stack (known from Umaku metadata) and adds synonyms.

Original: “Why is login broken?”
Rewritten: “AuthenticationService failure”, “LoginController exceptions”, “JWT validation errors”, “Auth middleware timeout”.

B. HyDE (Hypothetical Document Embeddings)

This is our secret weapon for code search. HyDE assumes that a good way to find a code snippet is to write a fake code snippet that looks like it.

This effectively bypasses the language barrier. We aren’t matching English to Python; we are matching (Fake) Python to (Real) Python.

Stage 3: Hybrid Search & Reranking

We now have a set of robust queries. We execute them against Pinecone using Hybrid Search (Dense Vectors + Sparse BM25). This gives us a broad net—perhaps 50 candidate chunks.

However, LLM context windows are precious. We can’t feed 50 chunks to the model; it will get “lost in the middle.”

The Reranker

We introduce a Cross-Encoder Reranker (we utilize Cohere Rerank v3 or BGE-Reranker-M3) at the end of the pipeline.

Candidate 1 (Score 0.1): A comment mentioning “login” in a CSS file. -> REJECT
Candidate 2 (Score 0.98): The AuthController.ts file handling the actual logic. -> KEEP

We reduce the top 50 down to the “Golden Top 5” before sending them to the generation model.

Stage 4: Multi-Agent Orchestration

Retrieval is rarely linear. Sometimes, finding the code isn’t enough; you need to know the context around the code.

Umaku uses a Multi-Agent Orchestrator pattern. When a complex query arrives—“Did Ahmed fix the bug in the auth service that was reported last sprint?”—single-step retrieval fails.

The Orchestrator breaks this into sub-tasks:

Umaku Agent: Query the board. “Find bugs assigned to Ahmed in Sprint X related to ‘auth’.” -> Returns Ticket ID NEX-101.
Code Agent: “Fetch the code associated with the commit linked to ticket NEX-101.”
Synthesizer: Compare the fetched code against the bug description.

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 4

But a Ferrari engine is useless without a steering wheel.

The “Cursor” Integration: Bringing Umaku to the Editor

The MCP Architecture

We built a lightweight Umaku MCP Server using Python. It exposes our retrieval pipeline as a set of “Tools” that Cursor can discover and invoke via JSON-RPC.

When a developer in Cursor types:

@Umaku fetch the interface for the payment user profile

The following handshake occurs invisibly:

Cursor (Client) sends a JSON-RPC request to our local MCP Server.
MCP Server calls the Umaku Retrieval API (the one we built in Part 3).
Umaku API performs the vector search, reranking, and authorship lookup.
MCP Server returns the code snippets + the name of the author (e.g., “Ahmed”) back to Cursor.

Why This Changes Everything

Umaku retrieves the actual Pydantic model from the backend repo and inserts it into their chat context. No Swagger docs needed. No Slack messages sent.

Closing the Loop: Agentic Sprint Reports

Retrieval is reactive. We wanted Umaku to be proactive.

Every two weeks, a sprint ends. Traditionally, a Project Manager spends hours compiling a report: What was done? What was descoped? Who overperformed?

We built a fleet of Observer Agents that run automatically when a sprint is marked “Closed” in Umaku.

The Agent Fleet

The Descoping Agent: Scans the board for tasks moved to “Backlog.” It checks the git logs—did any code get written for these? If git log shows zero activity but the time logs show 10 hours, it flags a “Ghost Work” anomaly.
The Bug Hunter: Correlations the Bugs column with specific commit clusters. It identifies “Hot Files”—files that were touched frequently and resulted in regressions.
The DevOps Auditor: Checks if the new code introduced any security smells (e.g., hardcoded keys) by scanning the chunks indexed in our vector store.

The Feedback Loop

These agents generate a structured Sprint Retrospective PDF that is delivered to the PM. This isn’t just a summary; it’s a graded assessment.

Sprint Score: 85/100
Velocity: +12%
Code Quality: “Warning: The AuthService file is becoming a ‘God Object’ (too many modifications by 5 different people).”

Conclusion: The Future of Code Search

We started this journey with a simple goal: make code searchable. We ended up building a system that understands the life of a project.

By moving Beyond Keywords, we enabled our AI to understand that code isn’t just text—it’s logic.

By indexing People, we acknowledged that software is a human endeavor.

And by integrating with MCP, we met our developers where they live.

At Omdena, this architecture has reduced our “Time to Resolution” for technical questions by 40%. New onboarders no longer feel lost; they have a context-aware mentor available 24/7.

Umaku isn’t just managing our projects anymore. It’s understanding them.

Interested in building agentic workflows? Join our community at Omdena.

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 1

The Problem: Why “Naive” RAG Breaks on Code

Principle 1: One File = At Least One Chunk

Principle 2: Language-Aware Splitting

Principle 3: First-Class Jupyter Support

Architecture: The Ingestion Logic

The Code Behind the Logic

The Pipeline: From Commit to Vector

Why Metadata Matters

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 2

The Embedding Strategy: Not All Vectors Are Created Equal

The “People” Layer: Mapping Git Ghosts to Umaku Users

The Identity Resolution Graph

The Indexing Workflow

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 3

Stage 1: The “Router” (Intent Classification)

Stage 2: Query Transformation (The “Complex” Part)

A. Query Rewriting (Expansion)

B. HyDE (Hypothetical Document Embeddings)

Stage 3: Hybrid Search & Reranking

The Reranker

Stage 4: Multi-Agent Orchestration

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 4

The “Cursor” Integration: Bringing Umaku to the Editor

The MCP Architecture

Why This Changes Everything

Closing the Loop: Agentic Sprint Reports

The Agent Fleet

The Feedback Loop

Conclusion: The Future of Code Search

Share this article

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 1

The Problem: Why “Naive” RAG Breaks on Code

Principle 1: One File = At Least One Chunk

Principle 2: Language-Aware Splitting

Principle 3: First-Class Jupyter Support

Architecture: The Ingestion Logic

The Code Behind the Logic

The Pipeline: From Commit to Vector

Why Metadata Matters

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 2

The Embedding Strategy: Not All Vectors Are Created Equal

The “People” Layer: Mapping Git Ghosts to Umaku Users

The Identity Resolution Graph

The Indexing Workflow

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 3

Stage 1: The “Router” (Intent Classification)

Stage 2: Query Transformation (The “Complex” Part)

A. Query Rewriting (Expansion)

B. HyDE (Hypothetical Document Embeddings)

Stage 3: Hybrid Search & Reranking

The Reranker

Stage 4: Multi-Agent Orchestration

Semantic Indexing for Code Search: Beyond Keywords to Meaning – Part 4

The “Cursor” Integration: Bringing Umaku to the Editor

The MCP Architecture

Why This Changes Everything

Closing the Loop: Agentic Sprint Reports

The Agent Fleet

The Feedback Loop

Conclusion: The Future of Code Search

Share this article