REALISE Lab | Concordia University

Democratizing Repository Automation with Cost-Efficient Open-Source LLMs

TL;DR:

Watch a short video pitch of our project (3 minutes):

Large language models (LLMs) have shown proficiency in several software engineering tasks such as code summarization, test case generation and code review automation. However, most popular benchmarks focus on limited contexts—often a single file or method—while repository-level challenges introduce new complexities. Managing extensive codebases means dealing with input context limits and maintaining a clear understanding of how interrelated files and components work together. Moreover, the most effective current solutions rely on closed-source LLMs that may incur significant recurring costs, raise privacy concerns with proprietary data shared via APIs, and have high energy consumption, all of which inhibit broader industrial adoption. To tackle these issues, we research the application of smaller, open-source LLMs by finding approaches to reduce the search space required to pinpoint relevant contexts within repositories. We explore agentic workflows that decompose complex tasks into manageable parts, enabling specialized LLM-based agents to handle multi-turn interactions and efficient tool-calling functions. Alongside this, we investigate and mitigate challenges unique to smaller models, such as ensuring reliable instruction-following and generating structured outputs—areas that remain underexplored but are critical for practical deployment.

Research Focus:

How do different code representations impact the accuracy of retrieval techniques in long-context understanding?

How do different code search strategies influence smaller LLMs’ ability to find information in large contexts?

How does integrating agentic workflows and external software engineering tools affect the performance and context understanding of smaller open-source LLMs in repository-level tasks?

Exploring the Potential of Llama Models in Automated Code Refinement: A Replication Study

Authors:

Genevieve Caumartin, Qiaolin Qin, Sharon Chatragadda, Janmitsinh Panjrolia, Heng Li, Diego Elias Costa

Venue:

The IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 2025

PDF Github Presentation

A Machine Learning-Based Approach For Detecting Malicious PyPI Package

Authors:

Haya Samaana, Diego Elias Costa, Ahmad Abdellatif, Emad Shihab

Venue:

The 40th ACM/SIGAPP Symposium On Applied Computing (ACM SAC), 2025

PDF

A Comparison of Natural Language Understanding Platforms for Chatbots in Software Engineering

Authors:

Ahmad Abdellatif, Khaled Badran, Diego Costa, and Emad Shihab

Venue:

IEEE Transactions on Software Engineering (TSE)

PDF Dataset

BibTex

Democratizing Repository Automation with Cost-Efficient Open-Source LLMs

TL;DR:

Research Focus:

Contacts:

Status:

Exploring the Potential of Llama Models in Automated Code Refinement: A Replication Study

Authors:

Venue:

A Machine Learning-Based Approach For Detecting Malicious PyPI Package

Authors:

Venue:

A Comparison of Natural Language Understanding Platforms for Chatbots in Software Engineering

Authors:

Venue: