CRAB (Code Review Automated Benchmark) is a high-quality dataset and extraction pipeline designed to evaluate automated code-review tools on two complementary tasks:

Review Comment Generation Given a code snapshot before review, generate natural-language comments emulating human reviewers.
Code Refinement (Revised Code Generation) Given the same snapshot plus a reviewer’s comment, generate the revised code implementing that feedback.

CRAB focuses on Java projects, rigorously curating pull-request “triplets” of

submitted_code (pre-review code)
reviewer_comment (validated natural-language feedback, with paraphrases)
revised_code (post-review implementation, validated via tests)

Features

Automated Extraction Pipeline (pull_requests.py)
- Clones GitHub repositories, locates PRs with a single review comment, and extracts diffs before/after the comment
- Builds and tests each snapshot in Docker (Maven & Gradle support)
- Generates JaCoCo coverage reports to ensure revised code covers the commented lines
Manual Validation Tools (manual_selection.py)
- Interactive review to mark whether comments suggest changes and whether post-comment diffs address them
Serialization & Task Extraction (dataset.py, extract_correct_predictions.py)
- Produce JSON datasets for:
  - Full (all validated triplets)
  - Comment Generation
  - Code Refinement
  - Web App export format
Utility Modules
- handlers.py: abstract and concrete build/test handlers (Maven, Gradle)
- utils.py: Git/GitHub helpers, BLEU-based paraphrase filtering, logging

Installation

Clone this repository

git clone https://github.com/your-org/crab
cd crab

Install Python dependencies
```
pip install -r requirements.txt
```
The pipeline depends on:
- pandas, tqdm, docker, beautifulsoup4, unidiff, PyGithub, javalang
Docker images
- Build or pull the two images used by the handlers:
  - crab-maven (for Maven projects)
  - crab-gradle (for Gradle projects)

Usage

Run the script to generate the CRAB dataset triplets:

python pull_requests.py [CSV_FILE] [options]

CSV_FILE: Path to the input CSV listing repositories (output of clone_repos.py).

Options

Parameter	Type	Default	Required	Description
`CSV_FILE`	string	—	Yes	The CSV file containing the list of GitHub repos to process.
`-o`, `--output`	string	`./dataset.json`	No	Path where the resulting JSON dataset will be saved.
`-r`, `--repos`	string	`./results/`	No	Directory under which repos will be (or already are) cloned.
`-c`, `--cache`	string	None	No	Path to a previous run’s JSON output to resume from (caches processed PRs).
`-a`, `--archive-destination`	string	`./dataset/archives`	No	Directory where per-PR archives (tar.gz) will be stored.
`-s`, `--sort-by`	string	None	No	Column name in the CSV by which to sort repos before processing.
`--only-repo`	string	None	No	Process only the specified repo (format: `owner/name`), ignoring all others in the CSV.
`--cache-requests`	flag	`false`	No	If set, caches GitHub API requests (using `requests_cache`) to speed up reruns at the risk of stale data.
`--max-workers`	integer	None (monothreaded)	No	Number of parallel workers for processing repos. If omitted, the script runs in a single thread.


**Example**

```sh
python pull_requests.py my_repos.csv \
  --output=data/triplets.json \
  --repos=./cloned_repos/ \
  --archive-destination=./archives/ \
  --cache-requests \
  --max-workers=4

This will:

Read my_repos.csv for the list of GitHub repositories.
Clone any missing repos under ./cloned_repos/.
Process each pull request, archiving the base and merged states under ./archives/.
Save the combined dataset to data/triplets.json.
Cache GitHub API calls for faster subsequent runs.
Use 4 parallel workers to speed up processing.

2. Run manual validation

Run the manual selection script to validate or refine your dataset entries:

python manual_selection.py [DATASET_FILE] -o OUTPUT [options]

DATASET_FILE: Path to the input JSON dataset (e.g. output of your preprocessing step).
-o, --output: Path where the updated dataset JSON will be saved.

Options

Parameter	Type	Default	Required	Description
`DATASET_FILE`	string	—	Yes	Path to the dataset JSON file to process.
`-o`, `--output`	string	—	Yes	Path where the resulting dataset (after manual selection/refinement) will be written.
`--overwrite`	flag	false	No	If set, re-evaluates and overwrites any existing `Selection` entries in the dataset.
`-m`, `--mode`	`ValidationMode` enum	`comment`	No	Validation mode to run in: • `comment` – only check if comments suggest a change. • `refinement` – check comment suggestions and whether diffs implement them.
`--check-diff-relevance`	flag	false	No	If set (only in `refinement` mode), first ask whether each diff is related to the comment before prompting for refinement.

3. Serialize to JSON for modeling

Load and process a dataset JSON, optionally add paraphrases, and serialize it in various formats:

python dataset.py [FILENAME] [options]

FILENAME: Path to the input JSON file to load (e.g., output of a previous run).

Options

Parameter	Type	Default	Required	Description
`FILENAME`	string	—	Yes	Path to the dataset JSON file to load.
`-o`, `--output`	string	`output.json`	No	Path where the processed dataset (or archive) will be saved.
`-p`, `--paraphrases`	string	None	No	CSV file containing generated paraphrases. Must include a `paraphrases` column with lines of the form `Paraphrase#N: <text>`. When provided, each paraphrase will be scored and (optionally) appended to its comment.
`-t`, `--output_type`	`OutputType` enum	`full`	No	Type of output to generate: • `full` – dump the entire dataset as JSON. • `comment_gen` – dump only entries whose comments suggest changes, as a ZIP of JSON (with `_with_context` or `_no_context`). • `code_refinement` – dump entries both covered and addressed, as a ZIP. • `webapp` – dump minimal fields for webapp.
`-a`, `--archives`	string	None	No	Root directory where per-PR archives (tar.gz) live. Relevant only for `comment_gen` or `code_refinement` outputs; will be bundled into the ZIP under `context/`.
`--remove-non-suggesting`	flag	false	No	When output type is `full`, drop entries whose comments do not suggest a change.

Examples

Basic full dump:

python dataset.py data/raw_dataset.json

Add paraphrases and overwrite default output path:

python dataset.py data/raw_dataset.json \
  -o data/with_paraphrases.json \
  -p paraphrases.csv

Generate a ZIP for code-refinement with context archives:

python dataset.py data/raw_dataset.json \
  -o outputs/code_refinement.zip \
  -t code_refinement \
  -a ./archives/

This will:

Load data/raw_dataset.json into memory.
If -p paraphrases.csv is given, read paraphrases, score them, and append non-redundant ones to each comment.
Serialize entries according to --output_type.
Bundle required archives (if any) into the resulting ZIP or write JSON to the specified --output.

4. Extract “ground truth” references

Run the script to extract “exact prediction” JSONs for comment‐generation, code‐refinement, or paraphrase tasks:

python extract_correct_predictions.py DATASET_JSON [options]

DATASET_JSON: Path to the input dataset JSON file.

Options

Parameter	Type	Default	Required	Description
`DATASET_JSON`	string	—	Yes	Path to the dataset JSON to process.
`-o`, `--output`	string	`exact_predictions_<type>.json`	No	Path for the output JSON file. If omitted, defaults to `exact_predictions_<output-type>.json`.
`-a`, `--archives`	string	—	Only for `code_refinement`	Directory where per-PR tar.gz archives live. Required when `--output-type=code_refinement` so merged file contents can be extracted.
`-t`, `--output-type`	`OutputType` enum	`comment_gen`	No	Which extraction to perform: • `comment_gen` – pull file+location+body for commenting tasks. • `code_refinement` – extract post-merge file contents for code tasks. • `paraphrases` – dump comments+before-PR files for paraphrase creation.

OutputType Values

Name	Value	Meaning
`COMMENT_GEN`	`comment_gen`	Extracts predicted comment locations & bodies to feed a comment‐generation model.
`CODE_REFINEMENT`	`code_refinement`	Extracts merged file snapshots for entries that both cover and address changes, to feed a refinement model.
`FOR_PARAPHRASES`	`paraphrases`	Extracts original comments plus “before-PR” file contents for paraphrase generation.

Examples

1. Default comment-generation extraction

python extract_correct_predictions.py data/dataset.json \
  -o predictions_comment.json

This reads data/dataset.json and writes all entries whose comments suggest changes to predictions_comment.json.

2. Code-refinement extraction

python extract_correct_predictions.py data/dataset.json \
  --output refined_files.json \
  --output-type code_refinement \
  --archives ./archives/

This will locate each merged PR archive under ./archives/, extract the post-merge file contents for entries that both cover and address changes, and save them to refined_files.json.

3. Paraphrase data extraction

python extract_correct_predictions.py data/dataset.json \
  -t paraphrases \
  -o comments_for_para.json

This dumps comment bodies plus “before-PR” file snapshots for all entries suggesting changes, suitable for paraphrase modeling.

Contributing

Issue Tracker: Please file issues for bugs or feature requests.
Pull Requests: Fork, create a topic branch, and submit a PR. Please include tests or validations where applicable.
Extending Build Support: To add a new build system (e.g., Ant, Bazel), subclass BuildHandler in handlers.py and provide the commands and container image.

README.md Unescape Escape

CRAB: Code Review Automated Benchmark

Features

Installation

Usage

Options

2. Run manual validation

Options

3. Serialize to JSON for modeling

Options

Examples

4. Extract “ground truth” references

Options

OutputType Values

Examples

Contributing

README.md