Learn how Umaku team built a context-aware semantic code search engine that goes beyond basic RAG to understand code structure, authorship, and intent.

By the Umaku Engineering Team
At Omdena, we manage hundreds of concurrent AI projects through Umaku, our AI-integrated project management platform. Our teams aren’t just moving tickets on a board; they are pushing code across diverse stacks—from React Native frontends to complex Python backend agents and Jupyter Notebooks filled with data science experiments.
We wanted to build a “Context-Aware Chatbot” that could answer questions like “Why is the authentication module failing in the latest sprint?” or “Who is the best person to fix this LangChain bug?”
However, we quickly realized that standard RAG (Retrieval-Augmented Generation) pipelines fail miserably at code. Naive chunking breaks function definitions, ignores file paths, and chokes on the JSON structure of Jupyter Notebooks.
In this four-part series, we will break down the architecture of our Semantic Code Search Engine. We’ll share how we moved beyond simple keyword matching to a system that understands the structure of code and the people behind it.
In Part 1, we dive into the Ingestion Engine: why we built a custom Commit Chunker API and how we solved the “messy data” problem of real-world repositories.
When building a RAG system for documentation, you can usually get away with splitting text every 500 tokens. Code is different. It is highly structured, interdependent, and brittle.
We identified three critical failures when using off-the-shelf splitters (like standard LangChain or LlamaIndex loaders) for our use case:
To solve this, we built the GitHub Commit Chunker API, a specialized FastAPI service designed to treat code as a first-class citizen.
The Solution: A “Smart” Ingestion Pipeline
Our solution isn’t just a script; it’s a dedicated microservice that sits between GitHub and our Vector Database (Pinecone). It adheres to a strict set of core principles designed to preserve semantic integrity.
We enforce a rule that small files (≤3000 characters) are never split. This ensures that config files (like package.json or Dockerfile) remain atomic. You never want to retrieve half a JSON file.
For files larger than 3000 characters, we don’t just split by character count. We use the LlamaIndex CodeSplitter, which utilizes tree-sitter parsers to understand the abstract syntax tree (AST) of the language. This ensures we split at logical boundaries—classes and functions—rather than in the middle of a for loop.
Since Omdena projects are AI-heavy, roughly 40% of our code experiments lives in Jupyter Notebooks. Our extractor parses the raw JSON of the .ipynb file, discards markdown cells and output binaries, and extracts only the executable code cells before chunking.
The logic is encapsulated in our GitHubRepoExtractor class. Below is the decision matrix we use for every single file in a commit.

Figure 1: The decision matrix behind our Smart Chunking API. Unlike standard splitters, we enforce file integrity and treat Jupyter Notebooks as first-class citizens in the ingestion pipeline. Specifically, note the separate path for .ipynb files which bypasses standard text parsing to avoid “JSON noise” in our embeddings.
Here is a simplified look at how we handle the extraction strategy within our FastAPI service. We utilize a CodeSplitter that dynamically adjusts based on the file extension.
def get_file_content(self, file_path, content):
"""Intelligent content extraction based on file type."""
if file_path.endswith('.ipynb'):
# Special handling for Notebooks: Extract code cells only
return self.read_notebook_content(content)
else:
# Standard text decoding for code files
return content.decode('utf-8')
def chunk_file(self, file_path, content, max_chars=3000):
"""
Applies the 'Smart Splitting' logic.
"""
# Rule 1: Atomic Chunks for small files
if len(content) <= max_chars:
return [Chunk(content=content, id=f"{self.commit_sha}_0")]
# Rule 2: AST-based splitting for large files
try:
splitter = CodeSplitter(
language=self.get_language(file_path),
chunk_lines=40, # Semantic window size
chunk_lines_overlap=15,
max_chars=max_chars
)
raw_chunks = splitter.split_text(content)
return [
Chunk(content=c, id=f"{self.commit_sha}_{i}")
for i, c in enumerate(raw_chunks)
]
except Exception:
# Fallback to simple character splitting if parser fails
return self.simple_split(content)
Building the extractor is only step one. In a live environment like Umaku, this needs to happen asynchronously every time a developer pushes code.
We implemented an event-driven architecture. When a commit hits a tracked branch, a webhook triggers our Chunker API. The API processes the files and returns structured JSON objects that include not just the code, but the context metadata: commit_sha, author_id, file_path, and change_status (added/modified/deleted).

Figure 2: The asynchronous ingestion pipeline. Commits are processed in real-time. The Chunker API (FastAPI) acts as the transformation layer, converting raw GitHub blobs into semantically rich chunks before they are embedded by our model fleet (OpenAI/Voyage) and indexed in Pinecone.
Notice that we don’t just store the text. We store the author_id. This is the foundational layer for the “People” aspect of our search engine. By indexing the committer_author_id alongside the code vector, we prepare the system for complex queries like “Show me the changes Ahmed made to the payment service last week.”
In Part 1, we detailed how we built the GitHub Commit Chunker API to solve the “Garbage In, Garbage Out” problem. We now have a clean, structured stream of code chunks—where functions are kept intact, Jupyter Notebooks are parsed correctly, and every line of code is traceable to a specific commit.
But a clean JSON object is not a search engine. To allow our AI Agents to answer questions like “How does the payment retry logic work?” or “Who introduced the latency in the last sprint?”, we need to translate these code blocks into a format the machine understands: Vectors.
In Part 2, we explore our indexing strategy. We will cover why we use specialized embedding models for code, how we utilize Pinecone Serverless for scale, and—most importantly—how we built the People-Code Graph to link anonymous Git commits to real Umaku user profiles.
Standard NLP embeddings (like the older BERT models) are great for English text but struggle with the syntax of Python or TypeScript. Code relies heavily on structure, variable naming conventions, and logic flow, which doesn’t always map 1:1 to natural language.
At Umaku, we adopted a multi-model strategy to ensure high-fidelity retrieval. We evaluated several leaders in the space:
By generating embeddings using models optimized for code, we ensure that a query for “Database Connection” retrieves PostgresClient.connect() rather than just text files containing the word “database.”
This is where standard RAG pipelines stop, and where Umaku begins.
In a standard repository, code is owned by a committer_author_id. This is often a shorthand (e.g., ahmed-dev) or a personal email (ahmed@gmail.com). However, in the Umaku platform, “Ahmed” is a Project Manager with a specific role, a history of completed sprints, and a specific set of skills.
If a user asks the chatbot, “How is Ahmed performing on the backend tasks?”, the system cannot answer if it only knows about ahmed-dev.
We solved this by building an Identity Resolution Layer that sits before the vector database.
When our Ingestion Engine processes a commit, it doesn’t just embed the code. It performs a lookup against the Umaku User Database. It attempts to map the Git email/handle to a Umaku Profile ID.

Figure 3: The Umaku Identity Graph. We don’t just index code; we index the relationships between developers, their commits, and their project management tickets. This graph allows the system to traverse from a natural language query about a person (“Ahmed”) to their specific digital exhaust (Git Identity) and finally to the code artifacts they authored.
This mapping allows us to “hydrate” our vector metadata. Instead of a blind vector, our payload to the database looks like this:
{
"id": "commit_7f3a1_0",
"values": [0.02, -0.15, 0.88, ...],
"metadata": {
"file_path": "backend/auth/login.py",
"language": "python",
"commit_sha": "7f3a1...",
"git_author": "ahmed-dev",
"Umaku_user_id": "USR-8821-X", // <--- The Critical Link
"sprint_id": "SPRINT-24",
"repo_context": "payment-service"
}
}
By embedding this Umaku User ID directly into the vector metadata, we unlock powerful filtering capabilities. The AI Agent can now execute queries like:
The final indexing workflow combines the Chunker API (from Part 1) with this metadata enrichment:
In Part 1, we built a “smart” ingestion engine to chunk code while preserving file integrity. In Part 2, we enriched those chunks with “People” metadata and indexed them in Pinecone.
Now comes the hardest part: Retrieval.
The reality of RAG (Retrieval-Augmented Generation) is that users are terrible at searching. A developer won’t type: “Select top 5 chunks from namespace backend-repo where cosine_similarity(embedding, ‘AuthService’) > 0.8.”
Instead, they type: “Why is the login broken?”
If we feed that raw string into a vector database, we get noise. The word “broken” rarely appears in code (hopefully), and “login” is too generic. To bridge the gap between vague human intent and precise code execution, we engineered a multi-stage Advanced Retrieval Architecture.
In Part 3, we break down the “Brain” of Umaku: Query Rewriting, Hallucinated Code (HyDE), and our Reranking pipeline.
Brain of Umaku: Query Rewriting, Hallucinated Code (HyDE), and our Reranking pipeline
Before we search, we need to know what we are looking for. We utilize a lightweight routing agent that classifies the user’s prompt into one of three domains:
This routing step prevents us from wasting compute on semantic code search when the user just wants to know who is on the team.
Once the router identifies a code-related query (e.g., “Why is the login broken?”), we don’t just search for it. We transform it.
We employ two specific techniques to expand the search space:
We use an LLM call to expand the query into domain-specific terms. The rewriter analyzes the project stack (known from Umaku metadata) and adds synonyms.
This is our secret weapon for code search. HyDE assumes that a good way to find a code snippet is to write a fake code snippet that looks like it.
When a user asks “How do we handle PDF parsing?”, the HyDE agent generates a hallucinated, dummy Python function using standard libraries (e.g., PyPDF2). We then embed this fake function and search our vector database for real code that looks mathematically similar to the fake code.
This effectively bypasses the language barrier. We aren’t matching English to Python; we are matching (Fake) Python to (Real) Python.

Figure 4: Moving beyond simple vector search. Our retrieval pipeline utilizes Query Expansion, Hypothetical Document Embeddings (HyDE), and Cross-Encoder Reranking to ensure the LLM receives only the most relevant context. Note how “Hybrid Search” combines dense vectors (concept matching) with sparse vectors (exact keyword matching) to catch specific variable names.
We now have a set of robust queries. We execute them against Pinecone using Hybrid Search (Dense Vectors + Sparse BM25). This gives us a broad net—perhaps 50 candidate chunks.
However, LLM context windows are precious. We can’t feed 50 chunks to the model; it will get “lost in the middle.”
We introduce a Cross-Encoder Reranker (we utilize Cohere Rerank v3 or BGE-Reranker-M3) at the end of the pipeline.
The Reranker takes the user’s original question and the 50 candidate chunks, and it scores them one by one based on strict relevance. It acts as a semantic filter, ruthlessly discarding false positives.
We reduce the top 50 down to the “Golden Top 5” before sending them to the generation model.
Retrieval is rarely linear. Sometimes, finding the code isn’t enough; you need to know the context around the code.
Umaku uses a Multi-Agent Orchestrator pattern. When a complex query arrives—“Did Ahmed fix the bug in the auth service that was reported last sprint?”—single-step retrieval fails.
The Orchestrator breaks this into sub-tasks:

Figure 5: The Orchestrator Pattern. A central router dynamically delegates tasks to specialized sub-agents. This allows the system to answer composite questions that require data from both the Project Management platform (Umaku) and the codebase (GitHub), merging them into a single coherent answer.
In this part and the previous parts of this series, we built a sophisticated engine under the hood. We created a Smart Ingestion Pipeline (Part 1), a People-Code Knowledge Graph (Part 2), and an Advanced Retrieval Brain using HyDE and Rerankers (Part 3).
But a Ferrari engine is useless without a steering wheel.
For our developers at Omdena, “context switching” is the enemy. If a frontend engineer working in Cursor has to minimize their IDE, open the Umaku dashboard, and type a query to find out how the backend API works, we have failed.
In this final part, we reveal how we brought the Umaku Search Engine directly into the developer’s environment using the Model Context Protocol (MCP) and how we closed the loop with Agentic Sprint Reports.
We didn’t want to build a custom VS Code extension from scratch. Instead, we adopted the Model Context Protocol (MCP), an open standard that allows AI models (like Claude 3.5 Sonnet running inside Cursor) to talk to external data servers.
We built a lightweight Umaku MCP Server using Python. It exposes our retrieval pipeline as a set of “Tools” that Cursor can discover and invoke via JSON-RPC.
When a developer in Cursor types:
@Umaku fetch the interface for the payment user profile
The following handshake occurs invisibly:

Figure 6: Bridging the gap with MCP. The Model Context Protocol acts as a universal translator. Developers working in Cursor on the frontend can transparently query the Umaku backend codebase. The diagram illustrates the JSON-RPC flow where the IDE (Client) requests context, and our Umaku Service acts as the Server, delivering “hydrated” code snippets without the user ever leaving their editor.
This integration solves the “Blind Frontend” problem. A React Native developer usually treats the backend as a black box. With Umaku MCP, they can highlight a function call in their editor and ask: “What data shape does this endpoint expect?”
Umaku retrieves the actual Pydantic model from the backend repo and inserts it into their chat context. No Swagger docs needed. No Slack messages sent.
Retrieval is reactive. We wanted Umaku to be proactive.
Every two weeks, a sprint ends. Traditionally, a Project Manager spends hours compiling a report: What was done? What was descoped? Who overperformed?
We built a fleet of Observer Agents that run automatically when a sprint is marked “Closed” in Umaku.
These agents generate a structured Sprint Retrospective PDF that is delivered to the PM. This isn’t just a summary; it’s a graded assessment.

Figure 7: The Continuous Improvement Loop. AI Agents act as autonomous auditors. At the end of every sprint, they analyze the “Chunked Codebase” and “Board Status” to generate actionable insights. This feedback is fed into the next sprint’s planning session, ensuring that technical debt is identified before it becomes a crisis.
We started this journey with a simple goal: make code searchable. We ended up building a system that understands the life of a project.
By moving Beyond Keywords, we enabled our AI to understand that code isn’t just text—it’s logic.
By indexing People, we acknowledged that software is a human endeavor.
And by integrating with MCP, we met our developers where they live.
At Omdena, this architecture has reduced our “Time to Resolution” for technical questions by 40%. New onboarders no longer feel lost; they have a context-aware mentor available 24/7.
Umaku isn’t just managing our projects anymore. It’s understanding them.
Interested in building agentic workflows? Join our community at Omdena.