# Evaluation The `evaluation/` folder provides SWE-agent compatible scripts for running [SWE-bench style evaluation](https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md) on model patch predictions. In addition, we also include additional scripts to quantify model performance on "subtasks" within the SWE-bench task, such as identifying the right file(s) to edit. ## 📖 Table of Contents - [Evaluation](#evaluation) - [📖 Table of Contents](#-table-of-contents) - [🐇 Quick Start ](#-quick-start-) - [🪑 SWE-bench Evaluation ](#-swe-bench-evaluation-) - [📈 Viewing Results ](#-viewing-results-) ## 🐇 Quick Start You can run evaluations on SWE-bench by passing in the predictions generated by SWE-agent (usually named `all_preds.jsonl`). Simply run the following script: ```bash ./run_eval.sh ``` Then run `./run_eval.sh`. Depending on the # of task instances and how long setting up the execution environment takes, the evaluation could take a couple minutes or to 7 hours for the entirety of the SWE-bench test split. When evaluation finishes, you should see an output similar to the following: ```bash 2024-03-31 16:47:00,263 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installing with command: . /n/fs/p-swe-bench/testbed/ba397fe0d6/pvlib__pvlib-python/0.8/tmpom22t9na/miniconda3/bin/activate pvlib__pvlib-python__0.8 && echo 'activate successful' && pip install -e .[all] 2024-03-31 16:47:10,602 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Installation successful 2024-03-31 16:47:10,619 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (test) 2024-03-31 16:47:10,635 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Apply patch successful (pred) 2024-03-31 16:47:13,453 - taskenv_context_manager - INFO - [pvlib__pvlib-python__0.8] [pvlib__pvlib-python-1395] Test script run successful ================================== Log directory for evaluation run: /n/fs/p-swe-bench/results/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4 == Evaluation Report == {'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5} - Wrote per-instance scorecards to //trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/scorecards.json - Wrote summary of run to //trajectories/carlosejimenez/gpt-4-1106-preview__swe-bench-dev-40-seed24__default_sys-env_window100-detailed_cmd_format-full_history-1_demos__t-0.20__p-0.95__c-4.00__install-1__sweep-01-run-4/results.json Reference Report: {'# Not Generated': 1, '# Generated': 36, '# Applied': 34, '# Resolved': 5} ``` ## 🪑 SWE-bench Evaluation `evaluation.py`: This script contains the logic for SWE-bench evaluation adapted for the SWE-agent setting. Given a set of predictions (e.g. `trajectories///all_preds.jsonl`), we... 1. Filter + analyze predictions. 2. Run SWE-bench style execution based evaluation. 3. Save outcomes to `results.json` and `scorecards.json` files with info about task-specific and overall performance. > `run_eval.sh` is provided as an example of how to run `evaluation.py` Arguments: * `--predictions_path (required)`: The path to the file containing predictions (.jsonl format). This file includes the predictions that need to be evaluated against the benchmark tasks. * `--log_dir (required)`: The directory path where log files related to the evaluation process will be stored. It's used for saving logs that are generated during the evaluation. * `--swe_bench_tasks (required)`: The path to the file containing the SWE-bench task instances. This file includes the details of the tasks against which the predictions will be evaluated. * `--testbed (required)`: The directory path for the testbed, which is likely used for setting up the environment or context for the evaluations. * `--skip_existing (optional)`: If specified, the script will skip over log files that already exist, preventing re-evaluation of those tasks. * `--timeout (optional)`: Specifies the timeout in seconds for the evaluation process (default is 900 seconds). This helps in controlling the duration of each evaluation task to avoid excessively long running times. * `--verbose (optional)`: Enables verbose mode, which will provide more detailed output during the script execution. This is useful for debugging or getting more insight into the process. * `--conda_link (optional)`: Allows specifying a URL to a Conda installation that should be used for the evaluation environment. This can be necessary if the evaluation requires a specific software environment. * `--log_suffix (optional)`: An additional parameter to specify a suffix for log files. This can be used for organizing logs more effectively, especially when running multiple evaluations in parallel or under different configurations. ## 📈 Viewing Results `aggregate_results.py`: This script aggregates and displays experiment results from the `trajectories/` folder. * Experiments are grouped by `(Model, Dataset, Config File, Temp., Top P, Cost, Install)`. * The following statistics for each experiment run are shown: * `Not Generated`: # of task instances with no patch generated * `Generated`: # of task instances with patch * `Applied`: # of patches that applied successfully * `Resolved`: # of task instances resolved * `Costs [Success|Failed|Overall]`: Cost of [successful|failed|any] run * If there are multiple runs of an experiment (distinguished by `--suffix run`), the above statistics are aggregate as totals or means. Usage: ``` python aggregate_results.py ``` Arguments: * `--folder (type: str, default: ../trajectories)`: Specifies the folder containing the experiment * results. This is where the script will look to gather data. * `--model (type: str, nargs: '+')`: Filters the results by model(s). Only results corresponding to the * specified model(s) will be included. * `--dataset (type: str, nargs: '+')`: Filters the results by dataset(s). Only results for the specified * dataset(s) will be analyzed. * `--setup (type: str, nargs: '+')`: Filters the results by setup(s). This allows focusing on specific * experiment configurations. * `--runs_min (type: int)`: The minimum number of runs an experiment should have to be included in the * analysis. Helps exclude experiments with insufficient data. * `--runs_max (type: int)`: The maximum number of runs to consider for each experiment. This can limit the data to the most relevant runs.