Features
- Supports over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
- Compatible with commercial APIs.
- Supports local models and custom benchmarks.
- Uses publicly available prompts to ensure reproducibility and comparability between papers.
- Easy integration of custom prompts and evaluation metrics.
Setup
- Create a SambaCloud account and obtain an API key.
-
Clone the
lm-evaluation-harness
repository: -
Create and activate a virtual environment:
-
Install dependencies
Additional Python packages may be required depending on the selected benchmark or task. If you encounter errors related to missing libraries, install them manually.
Example use case
Run this evaluation locally or in a notebook environment.- Example benchmark: GSM8K (Grade School Math)
- Model source: SambaCloud