RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository

Anonymous Authors
Under Review
RepoGenesis Overview

Overview of RepoGenesis. RepoGenesis is the first benchmark for evaluating repository-level microservice generation from natural language requirements. It includes 106 diverse web microservice repositories across 11 frameworks and 18 application domains, evaluating LLMs with multi-dimensional metrics including Pass@1, API Coverage (AC), and Deployment Success Rate (DSR).

Abstract

Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation. However, existing benchmarks primarily focus on function-level or class-level code generation, leaving a significant gap in evaluating LLMs' ability to generate complete, deployable software repositories. We introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation. RepoGenesis consists of 106 diverse web microservice repositories (60 Python, 46 Java) spanning 11 frameworks and 18 application domains. Unlike traditional benchmarks, RepoGenesis assesses LLMs' capability to generate complete repositories from natural language requirements, including project structure, dependencies, configurations, and implementation. We evaluate multiple LLM-based agents and commercial IDEs using three key metrics: Pass@1 for functional correctness, API Coverage (AC) for implementation completeness, and Deployment Success Rate (DSR) for deployability. Our comprehensive evaluation reveals significant challenges in repository-level code generation and provides insights into the current state and future directions of automated software development.

Key Features

  • 106 diverse web microservice repositories (60 Python, 46 Java)
  • 11 frameworks including Django, FastAPI, Javalin, Spring Boot, and more
  • 18 application domains covering authentication, content management, gaming, file management, and more
  • Multi-dimensional metrics: Pass@1 for functional correctness, API Coverage (AC) for implementation completeness, and Deployment Success Rate (DSR) for deployability
  • Support for multiple agents and IDEs: MetaGPT, DeepCode, Qwen-Agent, MS-Agent, Cursor, and GitHub Copilot

Evaluation Metrics

Pass@1 (Functional Correctness)

Measures whether the generated repository passes all test cases on the first attempt. A repository achieves Pass@1 = 1.0 only if all test cases pass.

API Coverage (AC)

Measures implementation completeness by checking if all required API endpoints are present. API endpoints are extracted from README specifications and validated in the generated code.

Deployment Success Rate (DSR)

Measures basic deployability by checking if: (1) Dependencies can be installed, (2) Service can start without errors, (3) Health check endpoint responds.

BibTeX

@misc{peng2026repogenesisbenchmarkingendtoendmicroservice,
      title={RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository}, 
      author={Zhiyuan Peng and Xin Yin and Pu Zhao and Fangkai Yang and Lu Wang and Ran Jia and Xu Chen and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang},
      year={2026},
      eprint={2601.13943},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.13943}, 
}