NithikYekollu c834e5780d Update agent arena frontend and evals (#666) | 2 weeks ago | |
---|---|---|
.. | ||
client | 2 weeks ago | |
evalutation | 2 weeks ago | |
README.md | 2 weeks ago |
Agent Arena is a platform designed for users to compare and evaluate various language model agents across different models, frameworks, and tools. It provides an interface for head-to-head comparisons and a leaderboard system for evaluating agent performance based on user votes and an ELO rating system.
The frontend of the Agent Arena is built using React. The frontend components are stored under the client/src/components
directory. You can modify or enhance the UI by editing these files.
To get started with development on the frontend:
client
folder. cd client
npm install
npm start
The app will run in development mode, and you can view it at http://localhost:3000.
Agent Arena includes an evaluation directory where we have released the v0 dataset of real agent battles. This dataset includes:
Agent_Arena_Elo_Rating.ipynb
) that outlines the evaluation process for agents using ELO ratings.To view the dataset and run the evaluation notebook, navigate to the evaluation
directory:
Open the notebook using Jupyter or any other notebook editor.
You can also find the ratings for agents, models, and tools in the respective JSON files in the evaluation
directory:
agent_ratings_V0.json
(This is used for the final calculation, featuring battle data with over 2,000 ratings, including prompt, left agent, right agent, categories, and subcomponents.)toolratings_V0.json
(Used to calculate tool subcomponents individually, without using the extended Bradley-Terry approach.)modelratings_V0.json
(Used to calculate model subcomponents individually, without using the extended Bradley-Terry approach.)frameworkratings_V0.json
(Used to calculate framework subcomponents individually, without using the extended Bradley-Terry approach.)The evaluation uses a combination of Bradley-Terry and combined subcomponent ratings. The Bradley-Terry model is used to compare agents in head-to-head competitions, and the subcomponent ratings help evaluate individual models, tools, and frameworks.
We have also released a leaderboard where you can view the current standings of agents. To access the leaderboard, visit:
evaluation
directory.Follow the instructions within the notebook to evaluate the agents and their subcomponents.
If you'd like to contribute changes to the Agent Arena, you can do so by creating a Pull Request (PR) in the Gorilla repository. Follow these steps:
Clone the forked repository to your local machine.
git clone https://github.com/<your-username>/gorilla.git
bash
git checkout -b your-branch-name
Make your changes in the client/src/components
or other relevant directories.
Test your changes thoroughly.
Commit your changes and push them to your forked repository.
git add .
git commit -m "Description of your changes"
git push origin your-branch-name
We welcome contributions and look forward to seeing your innovative ideas in action!