CRAB: Code Review Automated Benchmark
CRAB (Code Review Automated Benchmark) is a high-quality dataset and extraction pipeline designed to evaluate automated code-review tools on two complementary tasks:
- Review Comment Generation Given a code snapshot before review, generate natural-language comments emulating human reviewers.
- Code Refinement (Revised Code Generation) Given the same snapshot plus a reviewer’s comment, generate the revised code implementing that feedback.
CRAB focuses on Java projects, rigorously curating pull-request “triplets” of
- submitted_code (pre-review code)
- reviewer_comment (validated natural-language feedback, with paraphrases)
- revised_code (post-review implementation, validated via tests)
Features
-
Automated Extraction Pipeline (
pull_requests.py
)- Clones GitHub repositories, locates PRs with a single review comment, and extracts diffs before/after the comment
- Builds and tests each snapshot in Docker (Maven & Gradle support)
- Generates JaCoCo coverage reports to ensure revised code covers the commented lines
-
Manual Validation Tools (
manual_selection.py
)- Interactive review to mark whether comments suggest changes and whether post-comment diffs address them
-
Serialization & Task Extraction (
dataset.py
,extract_correct_predictions.py
)-
Produce JSON datasets for:
- Full (all validated triplets)
- Comment Generation
- Code Refinement
- Web App export format
-
-
Utility Modules
handlers.py
: abstract and concrete build/test handlers (Maven, Gradle)utils.py
: Git/GitHub helpers, BLEU-based paraphrase filtering, logging
Installation
-
Clone this repository
git clone https://github.com/karma-riuk/crab cd crab
-
(Optional) Create Python Environement
python -m venv .venv source .venv/bin/activate
-
Install Python dependencies
pip install -r requirements.txt
-
Docker images
The repository includes two Dockerfiles (
maven.Dockerfile
andgradle.Dockerfile
) at its root. Build the images locally from this directory:# Build the Maven handler image docker build -f maven.Dockerfile -t crab-maven . # Build the Gradle handler image docker build -f gradle.Dockerfile -t crab-gradle .
Usage
Run the script to generate the CRAB dataset triplets:
python pull_requests.py [CSV_FILE] [options]
- CSV_FILE: Path to the input CSV listing repositories (output of
clone_repos.py
).
Options
Parameter | Default | Required | Description |
---|---|---|---|
CSV_FILE |
— | Yes | The CSV file containing the list of GitHub repos to process. |
-o , --output |
./dataset.json |
No | Path where the resulting JSON dataset will be saved. |
-r , --repos |
./results/ |
No | Directory under which repos will be (or already are) cloned. |
-c , --cache |
None | No | Path to a previous run’s JSON output to resume from (caches processed PRs). |
-a , --archive-destination |
./dataset/archives |
No | Directory where per-PR archives (tar.gz) will be stored. |
-s , --sort-by |
None | No | Column name in the CSV by which to sort repos before processing. |
--only-repo |
None | No | Process only the specified repo (format: owner/name ), ignoring all others in the CSV. |
--cache-requests |
false |
No | If set, caches GitHub API requests (using requests_cache ) to speed up reruns at the risk of stale data. |
--max-workers |
None (monothreaded) | No | Number of parallel workers for processing repos. If omitted, the script runs in a single thread. |
Example
python pull_requests.py my_repos.csv \
--output=data/triplets.json \
--repos=./cloned_repos/ \
--archive-destination=./archives/ \
--cache-requests \
--max-workers=4
This will:
- Read
my_repos.csv
for the list of GitHub repositories. - Clone any missing repos under
./cloned_repos/
. - Process each pull request, archiving the base and merged states under
./archives/
. - Save the combined dataset to
data/triplets.json
. - Cache GitHub API calls for faster subsequent runs.
- Use 4 parallel workers to speed up processing.
2. Run manual validation
Run the manual selection script to validate or refine your dataset entries:
python manual_selection.py [DATASET_FILE] -o OUTPUT [options]
- DATASET_FILE: Path to the input JSON dataset (e.g. output of your preprocessing step).
- -o, --output: Path where the updated dataset JSON will be saved.
Options
Parameter | Default | Required | Description |
---|---|---|---|
DATASET_FILE |
— | Yes | Path to the dataset JSON file to process. |
-o , --output |
— | Yes | Path where the resulting dataset (after manual selection/refinement) will be written. |
--overwrite |
false | No | If set, re-evaluates and overwrites any existing Selection entries in the dataset. |
-m , --mode |
comment |
No | Validation mode to run in: • comment – only check if comments suggest a change.• refinement – check comment suggestions and whether diffs implement them. |
--check-diff-relevance |
false | No | If set (only in refinement mode), first ask whether each diff is related to the comment before prompting for refinement. |
3. Serialize to JSON for modeling
Load and process a dataset JSON, optionally add paraphrases, and serialize it in various formats:
python dataset.py [FILENAME] [options]
- FILENAME: Path to the input JSON file to load (e.g., output of a previous run).
Options
Parameter | Default | Required | Description |
---|---|---|---|
FILENAME |
— | Yes | Path to the dataset JSON file to load. |
-o , --output |
output.json |
No | Path where the processed dataset (or archive) will be saved. |
-p , --paraphrases |
None | No | CSV file containing generated paraphrases. Must include a paraphrases column with lines of the form Paraphrase#N: <text> . When provided, each paraphrase will be scored and (optionally) appended to its comment. |
-t , --output_type |
full |
No | Type of output to generate: • full – dump the entire dataset as JSON.• comment_gen – dump only entries whose comments suggest changes, as a ZIP of JSON (with _with_context or _no_context ).• code_refinement – dump entries both covered and addressed, as a ZIP.• webapp – dump minimal fields for webapp. |
-a , --archives |
None | No | Root directory where per-PR archives (tar.gz) live. Relevant only for comment_gen or code_refinement outputs; will be bundled into the ZIP under context/ . |
--remove-non-suggesting |
false | No | When output type is full , drop entries whose comments do not suggest a change. |
Examples
Basic full dump:
python dataset.py data/raw_dataset.json
Add paraphrases and overwrite default output path:
python dataset.py data/raw_dataset.json \
-o data/with_paraphrases.json \
-p paraphrases.csv
Generate a ZIP for code-refinement with context archives:
python dataset.py data/raw_dataset.json \
-o outputs/code_refinement.zip \
-t code_refinement \
-a ./archives/
This will:
- Load
data/raw_dataset.json
into memory. - If
-p paraphrases.csv
is given, read paraphrases, score them, and append non-redundant ones to each comment. - Serialize entries according to
--output_type
. - Bundle required archives (if any) into the resulting ZIP or write JSON to the specified
--output
.
4. Extract “ground truth” references
Run the script to extract “exact prediction” JSONs for comment‐generation, code‐refinement, or paraphrase tasks:
python extract_correct_predictions.py DATASET_JSON [options]
- DATASET_JSON: Path to the input dataset JSON file.
Options
Parameter | Default | Required | Description |
---|---|---|---|
DATASET_JSON |
— | Yes | Path to the dataset JSON to process. |
-o , --output |
exact_predictions_<type>.json |
No | Path for the output JSON file. If omitted, defaults to exact_predictions_<output-type>.json . |
-a , --archives |
— | Only for code_refinement |
Directory where per-PR tar.gz archives live. Required when --output-type=code_refinement so merged file contents can be extracted. |
-t , --output-type |
comment_gen |
No | Which extraction to perform: • comment_gen – pull file+location+body for commenting tasks.• code_refinement – extract post-merge file contents for code tasks.• paraphrases – dump comments+before-PR files for paraphrase creation. |
OutputType Values
Name | Value | Meaning |
---|---|---|
COMMENT_GEN |
comment_gen |
Extracts predicted comment locations & bodies to feed a comment‐generation model. |
CODE_REFINEMENT |
code_refinement |
Extracts merged file snapshots for entries that both cover and address changes, to feed a refinement model. |
FOR_PARAPHRASES |
paraphrases |
Extracts original comments plus “before-PR” file contents for paraphrase generation. |
Examples
1. Default comment-generation extraction
python extract_correct_predictions.py data/dataset.json \
-o predictions_comment.json
This reads data/dataset.json
and writes all entries whose comments suggest changes to predictions_comment.json
.
2. Code-refinement extraction
python extract_correct_predictions.py data/dataset.json \
--output refined_files.json \
--output-type code_refinement \
--archives ./archives/
This will locate each merged PR archive under ./archives/
, extract the post-merge file contents for entries that both cover and address changes, and save them to refined_files.json
.
3. Paraphrase data extraction
python extract_correct_predictions.py data/dataset.json \
-t paraphrases \
-o comments_for_para.json
This dumps comment bodies plus “before-PR” file snapshots for all entries suggesting changes, suitable for paraphrase modeling.
Contributing
- Issue Tracker: Please file issues for bugs or feature requests.
- Pull Requests: Fork, create a topic branch, and submit a PR. Please include tests or validations where applicable.
- Extending Build Support: To add a new build system (e.g., Ant, Bazel), subclass
BuildHandler
inhandlers.py
and provide the commands and container image.