Usama Saleem
Usama Saleem

Software Engineer + UX Designer

Cutting Query Response Time by 75% with LLMs

During my time as an AI Research Engineer at the Data Driven Analysis Lab at Concordia University, I built an LLM-powered chatbot that lets users chat with their codebases. The project, now live at askgit.io, needed a complete rethink of how we ingested data. If it's not up anymore, the YouTube video is still around: Introducing AskGit

The Problem

Large codebases are painful to navigate. Finding patterns, understanding architecture, tracing how a function gets used across hundreds of repos — traditional search tools just don't cut it. They're pattern matching, not reasoning.

The Solution: A Smarter Ingestion Pipeline

Instead of indexing every file indiscriminately, I built a pipeline that parses ASTs to pull out meaningful chunks, groups related code so context stays intact, and stores vector embeddings that capture actual semantic meaning.

The impact: query response time dropped 75%, ingestion got 95% faster, and the system handled 1,000+ repositories at once.

The Multi-Model Routing Architecture

One part of the system I'm particularly proud of was the query routing layer — though at the time I didn't think of it as "agents." I was just trying to cut costs and latency.

Here's how it worked: I ran three separate agents in sequence, each with a distinct role. The first was a question clarifying agent — davinci-001, the small fast model of its era — that would take the user's raw question and expand it with context pulled from the codebase metadata, disambiguating terms and reformulating it for better retrieval.

The second was a codebase crawling agent: a slightly larger model fine-tuned with our Pinecone vector DB to find the most relevant code chunks. It was scoped specifically to this task — the finetuning made it exceptional at knowing where to look.

The third was a synthesis agent that would take the retrieved chunks and generate the answer. This model had a known context window limit, so it was good at working within those constraints — summarizing, truncating, and restructuring information to fit what it could hold.

The complexity rating tied them all together. Before synthesis, the system would score how complex the response needed to be. If it hit above a 6 out of 10, it would hand off to GPT-4 to finish the answer. Everything under that threshold got resolved by the smaller models.

This system of specialized agents — question clarifying, codebase crawling, synthesis — each with distinct capabilities and context limits, is essentially what people now call "agents." I was building it before the term was even in circulation. When the agentic AI wave hit a couple years later, I looked back at the architecture and realized I'd already been doing it.

The lesson? Good architecture often anticipates where the field is going, even if you don't have the vocabulary for it yet.

Presenting at FM+SE in Mexico City

Presentation at FM+SE Mexico City

The ingestion work earned me a presentation slot at the FM+SE conference in Mexico City. Audience voted it top 5%, which led to a tech partnership afterward.

What I Learned

The model is only as good as what's feeding it. Vector databases like Pinecone helped, but the real gains came from how we structured and chunked the code before embedding it. Chunk strategy, metadata choices, and update handling — those are the things I'd spend more time on upfront if I did it again.