Language Model Evaluation Harness is a unified evaluation framework developed by EleutherAI for testing generative language models across a wide range of evaluation tasks.

Features

  • Supports over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
  • Compatible with commercial APIs.
  • Supports local models and custom benchmarks.
  • Uses publicly available prompts to ensure reproducibility and comparability between papers.
  • Easy integration of custom prompts and evaluation metrics.

Setup

  1. Create a SambaNova Cloud account and obtain an API key.

  2. Clone the lm-evaluation-harness repository:

    git clone https://github.com/EleutherAI/lm-eval-harness.git
    cd lm-evaluation-harness
    
  3. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  
    
  4. Install dependencies

    pip install -e .
    pip install -e ."[api]"
    pip install tqdm
    

Additional Python packages may be required depending on the selected benchmark or task. If you encounter errors related to missing libraries, install them manually.

Example use case

Run this evaluation locally or in a notebook environment.

  • Example benchmark: GSM8K (Grade School Math)
  • Model source: SambaNova Cloud

Resources

This example demonstrates how to evaluate the reasoning and arithmetic skills of an LLM using standard prompt formats and metrics.