# Evaluation

The `evaluation/` folder provides SWE-agent compatible scripts for running [SWE-bench style evaluation](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md) on model patch predictions. In addition, we also include additional scripts to quantify model performance on "subtasks" within the SWE-bench task, such as identifying the right file(s) to edit.

## 📖 Table of Contents
- [Evaluation](#evaluation)
  - [📖 Table of Contents](#-table-of-contents)
  - [🐇 Quick Start ](#-quick-start-)
  - [🪑 SWE-bench Evaluation ](#-swe-bench-evaluation-)
  - [📈 Viewing Results ](#-viewing-results-)

## 🐇 Quick Start <a name="quick"></a>
You can run evaluations on SWE-bench by passing in the predictions generated by SWE-agent (usually named `all_preds.jsonl`). Simply run the following script:

```bash
./run_eval.sh <path to predictions>
```

Then run `./run_eval.sh`. Depending on the # of task instances and how long setting up the execution environment takes, the evaluation could take a couple minutes or to 7 hours for the entirety of the SWE-bench test split.

When evaluation finishes, you should see an output similar to the following:
```bash
2024-03-31 16:47:00,263 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installing with command: . /n/fs/p-swe-bench/testbed/ba397fe0d6/pvlib__pvlib-python/0.8/tmpom22t9na/miniconda3/bin/activate pvlib__pvlib-python__0.8 && echo 'activate successful' && pip install -e .[all]
2024-03-31 16:47:10,602 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installation successful
2024-03-31 16:47:10,619 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (test)
2024-03-31 16:47:10,635 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (pred)
2024-03-31 16:47:13,453 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Test script run successful
==================================
Log directory for evaluation run: /n/fs/p-swe-bench/results/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4
== Evaluation Report ==
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}
- Wrote per-instance scorecards to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/scorecards.json
- Wrote summary of run to /<path to SWE-agent>/trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/results.json
Reference Report:
{'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5}
```

## 🪑 SWE-bench Evaluation <a name="eval"></a>
`evaluation.py`: This script contains the logic for SWE-bench evaluation adapted for the SWE-agent setting. Given a set of predictions (e.g. `trajectories/<user>/<experiment>/all_preds.jsonl`), we...
1. Filter + analyze predictions.
2. Run SWE-bench style execution based evaluation.
3. Save outcomes to `results.json` and `scorecards.json` files with info about task-specific and overall performance.

> `run_eval.sh` is provided as an example of how to run `evaluation.py`

Arguments:
* `--predictions_path (required)`: The path to the file containing predictions (.jsonl format). This file includes the predictions that need to be evaluated against the benchmark tasks.
* `--log_dir (required)`: The directory path where log files related to the evaluation process will be stored. It's used for saving logs that are generated during the evaluation.
* `--swe_bench_tasks (required)`: The path to the file containing the SWE-bench task instances. This file includes the details of the tasks against which the predictions will be evaluated.
* `--testbed (required)`: The directory path for the testbed, which is likely used for setting up the environment or context for the evaluations.
* `--skip_existing (optional)`: If specified, the script will skip over log files that already exist, preventing re-evaluation of those tasks.
* `--timeout (optional)`: Specifies the timeout in seconds for the evaluation process (default is 900 seconds). This helps in controlling the duration of each evaluation task to avoid excessively long running times.
* `--verbose (optional)`: Enables verbose mode, which will provide more detailed output during the script execution. This is useful for debugging or getting more insight into the process.
* `--conda_link (optional)`: Allows specifying a URL to a Conda installation that should be used for the evaluation environment. This can be necessary if the evaluation requires a specific software environment.
* `--log_suffix (optional)`: An additional parameter to specify a suffix for log files. This can be used for organizing logs more effectively, especially when running multiple evaluations in parallel or under different configurations.

## 📈 Viewing Results <a name="viewer"></a>
`aggregate_results.py`: This script aggregates and displays experiment results from the `trajectories/` folder.
* Experiments are grouped by `(Model, Dataset, Config File, Temp., Top P, Cost, Install)`.
* The following statistics for each experiment run are shown:
    * `Not Generated`: # of task instances with no patch generated
    * `Generated`: # of task instances with patch
    * `Applied`: # of patches that applied successfully
    * `Resolved`: # of task instances resolved
    * `Costs [Success|Failed|Overall]`: Cost of [successful|failed|any] run
* If there are multiple runs of an experiment (distinguished by `--suffix run<i>`), the above statistics are aggregate as totals or means.

Usage:
```
python aggregate_results.py
```

Arguments:
* `--folder (type: str, default: ../trajectories)`: Specifies the folder containing the experiment * results. This is where the script will look to gather data.
* `--model (type: str, nargs: '+')`: Filters the results by model(s). Only results corresponding to the * specified model(s) will be included.
* `--dataset (type: str, nargs: '+')`: Filters the results by dataset(s). Only results for the specified * dataset(s) will be analyzed.
* `--setup (type: str, nargs: '+')`: Filters the results by setup(s). This allows focusing on specific * experiment configurations.
* `--runs_min (type: int)`: The minimum number of runs an experiment should have to be included in the * analysis. Helps exclude experiments with insufficient data.
* `--runs_max (type: int)`: The maximum number of runs to consider for each experiment. This can limit the data to the most relevant runs.