Software development requires cooperation among people with diverse skills and typically includes design, coding, and testing stages. Past deep learning techniques often needed custom designs and were ineffective for software development's non-deterministic behavior. Multi-agent LLM-powered systems communicate through multi-turn dialogues.
In this project I develop a multi-agent system for software development using large language models (LLMs). The system is composed of three subsystems:
System designer and planner: responsible for dividing the development process into smaller subtasks or work-packages. Uses chain-of-thought and few-shot learning.
Coding system: The coding system is composed of an instructor and an assistant agent that communicate through multi-turn dialogues. Uses communicative dehallucination and repeated-instruction mechanism Communicative dehallucination mechanism which consists on two agents, one instructor and one assistant in which the coding assistant asks for clarifications before giving a final response.
Bugfixer system: The bugfixer system is an agent that combines Reflection, ReAct pattern, Chain-of-thought and Few-shot learning to fix bugs. Function calling allow for interaction with docker for running complete microservices stacks, reading, writing files and providing context.
Due to the stochastic answers provided by LLMs an auxiliary agent called information extractor is proposed. Consist of a multi-step ReAct agent capable of more precise extraction of pieces of code and/or text.
All agents have access to long-term database for inter-stage communication and use short-term memory for conversation history.
The system is implemented using multi-threading and publish-subscribe communication pattern for inter-agent communication.
The system uses metrics such as executability, consistency, completeness, and quality to evaluate the generated software code.
Evaluating the generated software is challenging. Typical function-oriented metrics like pass@k do not comprehensively evaluate entire software systems. I proposed three metrics:
This applies for one work-package or subpackage at a time. Multiple instances are spawned as required.
Makes use of 3 sub-systems:
Techniques used: