2025-06-12 16:53:37 +02:00
2025-02-17 11:38:35 +01:00
2025-05-21 10:59:07 +02:00
2025-02-28 11:46:38 +01:00
2025-02-28 11:12:07 +01:00
2025-03-01 17:16:58 +01:00
2025-05-20 09:48:31 +02:00
2025-03-01 17:16:58 +01:00
2025-06-12 16:53:37 +02:00
2025-02-28 11:12:07 +01:00

CRAB: Code Review Automated Benchmark

CRAB (Code Review Automated Benchmark) is a high-quality dataset and extraction pipeline designed to evaluate automated code-review tools on two complementary tasks:

  1. Review Comment Generation Given a code snapshot before review, generate natural-language comments emulating human reviewers.
  2. Code Refinement (Revised Code Generation) Given the same snapshot plus a reviewers comment, generate the revised code implementing that feedback.

CRAB focuses on Java projects, rigorously curating pull-request “triplets” of

  • submitted_code (pre-review code)
  • reviewer_comment (validated natural-language feedback, with paraphrases)
  • revised_code (post-review implementation, validated via tests)

Features

  • Automated Extraction Pipeline (pull_requests.py)

    • Clones GitHub repositories, locates PRs with a single review comment, and extracts diffs before/after the comment
    • Builds and tests each snapshot in Docker (Maven & Gradle support)
    • Generates JaCoCo coverage reports to ensure revised code covers the commented lines
  • Manual Validation Tools (manual_selection.py)

    • Interactive review to mark whether comments suggest changes and whether post-comment diffs address them
  • Serialization & Task Extraction (dataset.py, extract_correct_predictions.py)

    • Produce JSON datasets for:

      • Full (all validated triplets)
      • Comment Generation
      • Code Refinement
      • Web App export format
  • Utility Modules

    • handlers.py: abstract and concrete build/test handlers (Maven, Gradle)
    • utils.py: Git/GitHub helpers, BLEU-based paraphrase filtering, logging

Installation

  1. Clone this repository

    git clone https://github.com/your-org/crab
    cd crab
    
  2. Install Python dependencies

    pip install -r requirements.txt
    

    The pipeline depends on:

    • pandas, tqdm, docker, beautifulsoup4, unidiff, PyGithub, javalang
  3. Docker images

    • Build or pull the two images used by the handlers:

      • crab-maven (for Maven projects)
      • crab-gradle (for Gradle projects)

Usage

Run the script to generate the CRAB dataset triplets:

python pull_requests.py [CSV_FILE] [options]
  • CSV_FILE: Path to the input CSV listing repositories (output of clone_repos.py).

Options

Parameter Type Default Required Description
CSV_FILE string Yes The CSV file containing the list of GitHub repos to process.
-o, --output string ./dataset.json No Path where the resulting JSON dataset will be saved.
-r, --repos string ./results/ No Directory under which repos will be (or already are) cloned.
-c, --cache string None No Path to a previous runs JSON output to resume from (caches processed PRs).
-a, --archive-destination string ./dataset/archives No Directory where per-PR archives (tar.gz) will be stored.
-s, --sort-by string None No Column name in the CSV by which to sort repos before processing.
--only-repo string None No Process only the specified repo (format: owner/name), ignoring all others in the CSV.
--cache-requests flag false No If set, caches GitHub API requests (using requests_cache) to speed up reruns at the risk of stale data.
--max-workers integer None (monothreaded) No Number of parallel workers for processing repos. If omitted, the script runs in a single thread.

**Example**

```sh
python pull_requests.py my_repos.csv \
  --output=data/triplets.json \
  --repos=./cloned_repos/ \
  --archive-destination=./archives/ \
  --cache-requests \
  --max-workers=4

This will:

  1. Read my_repos.csv for the list of GitHub repositories.
  2. Clone any missing repos under ./cloned_repos/.
  3. Process each pull request, archiving the base and merged states under ./archives/.
  4. Save the combined dataset to data/triplets.json.
  5. Cache GitHub API calls for faster subsequent runs.
  6. Use 4 parallel workers to speed up processing.

2. Run manual validation

Run the manual selection script to validate or refine your dataset entries:

python manual_selection.py [DATASET_FILE] -o OUTPUT [options]
  • DATASET_FILE: Path to the input JSON dataset (e.g. output of your preprocessing step).
  • -o, --output: Path where the updated dataset JSON will be saved.

Options

Parameter Type Default Required Description
DATASET_FILE string Yes Path to the dataset JSON file to process.
-o, --output string Yes Path where the resulting dataset (after manual selection/refinement) will be written.
--overwrite flag false No If set, re-evaluates and overwrites any existing Selection entries in the dataset.
-m, --mode ValidationMode enum comment No Validation mode to run in:
comment only check if comments suggest a change.
refinement check comment suggestions and whether diffs implement them.
--check-diff-relevance flag false No If set (only in refinement mode), first ask whether each diff is related to the comment before prompting for refinement.

3. Serialize to JSON for modeling

Load and process a dataset JSON, optionally add paraphrases, and serialize it in various formats:

python dataset.py [FILENAME] [options]
  • FILENAME: Path to the input JSON file to load (e.g., output of a previous run).

Options

Parameter Type Default Required Description
FILENAME string Yes Path to the dataset JSON file to load.
-o, --output string output.json No Path where the processed dataset (or archive) will be saved.
-p, --paraphrases string None No CSV file containing generated paraphrases. Must include a paraphrases column with lines of the form Paraphrase#N: <text>. When provided, each paraphrase will be scored and (optionally) appended to its comment.
-t, --output_type OutputType enum full No Type of output to generate:
full dump the entire dataset as JSON.
comment_gen dump only entries whose comments suggest changes, as a ZIP of JSON (with _with_context or _no_context).
code_refinement dump entries both covered and addressed, as a ZIP.
webapp dump minimal fields for webapp.
-a, --archives string None No Root directory where per-PR archives (tar.gz) live. Relevant only for comment_gen or code_refinement outputs; will be bundled into the ZIP under context/.
--remove-non-suggesting flag false No When output type is full, drop entries whose comments do not suggest a change.

Examples

Basic full dump:

python dataset.py data/raw_dataset.json

Add paraphrases and overwrite default output path:

python dataset.py data/raw_dataset.json \
  -o data/with_paraphrases.json \
  -p paraphrases.csv

Generate a ZIP for code-refinement with context archives:

python dataset.py data/raw_dataset.json \
  -o outputs/code_refinement.zip \
  -t code_refinement \
  -a ./archives/

This will:

  1. Load data/raw_dataset.json into memory.
  2. If -p paraphrases.csv is given, read paraphrases, score them, and append non-redundant ones to each comment.
  3. Serialize entries according to --output_type.
  4. Bundle required archives (if any) into the resulting ZIP or write JSON to the specified --output.

4. Extract “ground truth” references

Run the script to extract “exact prediction” JSONs for commentgeneration, coderefinement, or paraphrase tasks:

python extract_correct_predictions.py DATASET_JSON [options]
  • DATASET_JSON: Path to the input dataset JSON file.

Options

Parameter Type Default Required Description
DATASET_JSON string Yes Path to the dataset JSON to process.
-o, --output string exact_predictions_<type>.json No Path for the output JSON file. If omitted, defaults to exact_predictions_<output-type>.json.
-a, --archives string Only for code_refinement Directory where per-PR tar.gz archives live. Required when --output-type=code_refinement so merged file contents can be extracted.
-t, --output-type OutputType enum comment_gen No Which extraction to perform:
comment_gen pull file+location+body for commenting tasks.
code_refinement extract post-merge file contents for code tasks.
paraphrases dump comments+before-PR files for paraphrase creation.

OutputType Values

Name Value Meaning
COMMENT_GEN comment_gen Extracts predicted comment locations & bodies to feed a commentgeneration model.
CODE_REFINEMENT code_refinement Extracts merged file snapshots for entries that both cover and address changes, to feed a refinement model.
FOR_PARAPHRASES paraphrases Extracts original comments plus “before-PR” file contents for paraphrase generation.

Examples

1. Default comment-generation extraction

python extract_correct_predictions.py data/dataset.json \
  -o predictions_comment.json

This reads data/dataset.json and writes all entries whose comments suggest changes to predictions_comment.json.


2. Code-refinement extraction

python extract_correct_predictions.py data/dataset.json \
  --output refined_files.json \
  --output-type code_refinement \
  --archives ./archives/

This will locate each merged PR archive under ./archives/, extract the post-merge file contents for entries that both cover and address changes, and save them to refined_files.json.


3. Paraphrase data extraction

python extract_correct_predictions.py data/dataset.json \
  -t paraphrases \
  -o comments_for_para.json

This dumps comment bodies plus “before-PR” file snapshots for all entries suggesting changes, suitable for paraphrase modeling.


Contributing

  1. Issue Tracker: Please file issues for bugs or feature requests.
  2. Pull Requests: Fork, create a topic branch, and submit a PR. Please include tests or validations where applicable.
  3. Extending Build Support: To add a new build system (e.g., Ant, Bazel), subclass BuildHandler in handlers.py and provide the commands and container image.
Description
The code to generate the dataset for the benchmark of my Master's Thesis CRAB: Code Review Automated Benchmark
Readme 2.3 MiB
Languages
Python 97.8%
Dockerfile 1.1%
Shell 1.1%