Evaluation and monitoring
LM Evaluation Harness
Language Model Evaluation Harness is a unified evaluation framework developed by EleutherAI for testing generative language models across a wide range of evaluation tasks.
Features
- Supports over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
- Compatible with commercial APIs.
- Supports local models and custom benchmarks.
- Uses publicly available prompts to ensure reproducibility and comparability between papers.
- Easy integration of custom prompts and evaluation metrics.
Setup
-
Create a SambaNova Cloud account and obtain an API key.
-
Clone the
lm-evaluation-harness
repository: -
Create and activate a virtual environment:
-
Install dependencies
Additional Python packages may be required depending on the selected benchmark or task. If you encounter errors related to missing libraries, install them manually.
Example use case
Run this evaluation locally or in a notebook environment.
- Example benchmark: GSM8K (Grade School Math)
- Model source: SambaNova Cloud
Resources
This example demonstrates how to evaluate the reasoning and arithmetic skills of an LLM using standard prompt formats and metrics.