Evaluation Runner for Filbench

This repository contains the implementation for FilBench, an Open LLM Leaderboard and Evaluation Suite for Filipino. It is a fork of HuggingFace's lighteval library, with new Filipino-focused benchmarks implemented under the hood.

⌛ Set-up and Installation

First, clone the repository and install all dependencies:

git clone git@github.com:filbench/lighteval.git
# Create a virtualenv
python3 -m venv venv
pip install -e .[dev,vllm]

If you're developing FilBench, we encourage installing a pre-commit hook:

pre-commit install
pre-commit run --all-files

(Using uv) Another way to install is via uv. Internally, we found that uv installation is way faster and correct.

uv pip install --no-cache -e ".[dev,vllm]"

🔎 Inspecting a task

FilBench contains a suite of evaluation benchmarks from a variety of Filipino datasets. They are in the following format filbench|{task_name}|{few_shot}|{truncate_few_shots} You can find all tasks in the examples/tasks/all_filbench_tasks.txt file. For example, let's inspect the anatomy subset of Global-MMLU:

python -m lighteval tasks inspect "filbench|global_mmlu_all_tgl_mcf:anatomy|0|0" \
  --num-samples 1 \
  --custom-tasks community_tasks/filbench_evals.py

Output:

{ 'choices': [' A', ' B', ' C', ' D'],
  'ctx': '',
  'fewshot_sorting_class': None,
  'gold_index': [0],
  'instruction': '',
  'num_asked_few_shots': -1,
  'num_effective_few_shots': -1,
  'original_query': '',
  'query': 'Tanong: Ang isang sugat na nagdudulot ng compression ng facial '
           'nerve sa stylomastoid foramen ay magdudulot ng ipsilateral\n'
           ' A. Paralysis ng facial muscles.\n'
           ' B. Paralysis ng facial muscles at pagkawala ng panlasa.\n'
           ' C. Paralysis ng facial muscles, pagkawala ng lasa at '
           'lacrimation.\n'
           ' D. Paralysis ng facial muscles, pagkawala ng lasa, lacrimation at '
           'pagbaba ng salivation.\n'
           'Sagot:',
  'specific': None,
  'task_name': 'filbench|global_mmlu_all_tgl_mcf:anatomy',
  'unconditioned_query': 'Sagot:'}

Tip

Always remember to pass community_tasks/filbench_evals.py in the --custom-tasks parameter. In addition, running all commands as a module (i.e., using python -m lighteval instead of lighteval) solves some pathing or weird errors.

You can also check all tasks available in filbench (and all of lighteval) via this command:

# Saves all tasks in a file called `all_tasks.txt`
python -m lighteval tasks list --custom-tasks community_tasks/filbench_evals.py > all_tasks.txt

▶️ Running a task

Please check lighteval's official documentation on running tasks. Nothing much differs except that all of FilBench's tasks are registered in the filbench suite. Use python -m lighteval instead of just lighteval.

(For FilBench developers) First you need to log-in to HuggingFace or set your HF_TOKEN.

export HF_TOKEN=<your HF token here>
huggingface-cli login  # alternative method for log-in

To run on the full evaluation suite, we advise using the following commands:

# For models in HuggingFace and accessible via vLLM
cat examples/tasks/all_filbench_tasks.txt | xargs -I {} python -m lighteval vllm "pretrained=<MODEL_NAME>" {} --push-to-hub --results-org UD-Filipino --custom-tasks community_tasks/filbench_evals.py

# For models using the OpenAI  API
export OPENAI_API_KEY=<...>
cat examples/tasks/all_filbench_tasks.txt | xargs -I {} python -m lighteval  endpoint openai "<MODEL_NAME>" {} --push-to-hub --results-org UD-Filipino --custom-tasks community_tasks/filbench_evals.py

(For FilBench developers) Instead of passing the model name in the CLI, you must use or create a predefined YAML model configuration in filbench/model_configs/ and pass the path instead.

🆕 Implementing a new task

Our structure differs quite a bit from the community tasks in lighteval. Specifically, we implement one task per file in the filbench/ directory. This helps a lot in organization and for multiple people working on different benchmarks at the same time.

Implement the task as a new file in the filbench/ directory. Check if there are similar implementations in the existing tasks in lighteval. By default, we follow their implementations to ensure that we're consistent with existing benchmarks. You can check all existing implementations in the filbench/ directory as reference.
Add the task in the TASK_TABLE constant in the community_tasks/filbench_evals.py file. This file is our main entrypoint for running evaluations.
Ensure that nothing is amiss— inspect the task using python -m lighteval tasks inspect to examine a single sample.
If everything looks good, add the task string, i.e., filbench|{task_name}|{few_shot}|{truncate_few_shots} in the examples/tasks/all_filbench_tasks.txt file.

Your go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.

Documentation: HF's doc

Unlock the Power of LLM Evaluation with Lighteval 🚀

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends—whether it's transformers, tgi, vllm, or nanotron—with ease. Dive deep into your model’s performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.

Customization at your fingertips: letting you either browse all our existing tasks and metrics or effortlessly create your own custom task and custom metric, tailored to your needs.

Seamlessly experiment, benchmark, and store your results on the Hugging Face Hub, S3, or locally.

🔑 Key Features

Speed: Use vllm as backend for fast evals.
Completeness: Use the accelerate backend to launch any models hosted on Hugging Face.
Seamless Storage: Save results in S3 or Hugging Face Datasets.
Python API: Simple integration with the Python API.
Custom Tasks: Easily add custom tasks.
Versatility: Tons of metrics and tasks ready to go.

⚡️ Installation

Note that lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux)

pip install lighteval

Lighteval allows for many extras when installing, see here for a complete list.

If you want to push results to the Hugging Face Hub, add your access token as an environment variable:

huggingface-cli login

🚀 Quickstart

Lighteval offers the following entry points for model evaluation:

lighteval accelerate : evaluate models on CPU or one or more GPUs using 🤗 Accelerate
lighteval nanotron: evaluate models in distributed settings using ⚡️ Nanotron
lighteval vllm: evaluate models on one or more GPUs using 🚀 VLLM
lighteval endpoint
- inference-endpoint: evaluate models on one or more GPUs using 🔗 Inference Endpoint
- tgi: evaluate models on one or more GPUs using 🔗 Text Generation Inference
- openai: evaluate models on one or more GPUs using 🔗 OpenAI API

Here’s a quick command to evaluate using the Accelerate backend:

lighteval accelerate \
    "model_name=gpt2" \
    "leaderboard|truthfulqa:mc|0|0"

🙏 Acknowledgements

Lighteval started as an extension of the fantastic Eleuther AI Harness (which powers the Open LLM Leaderboard) and draws inspiration from the amazing HELM framework.

While evolving Lighteval into its own standalone tool, we are grateful to the Harness and HELM teams for their pioneering work on LLM evaluations.

🌟 Contributions Welcome 💙💚💛💜🧡

Got ideas? Found a bug? Want to add a task or metric? Contributions are warmly welcomed!

If you're adding a new feature, please open an issue first.

If you open a PR, don't forget to run the styling!

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

📜 Citation

@misc{lighteval,
  author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.8.0},
  url = {https://github.com/huggingface/lighteval}
}

Name		Name	Last commit message	Last commit date
Latest commit History 535 Commits
.github		.github
assets		assets
community_tasks		community_tasks
docs/source		docs/source
examples		examples
filbench		filbench
scripts		scripts
src/lighteval		src/lighteval
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Evaluation Runner for Filbench

⌛ Set-up and Installation

🔎 Inspecting a task

▶️ Running a task

🆕 Implementing a new task

Unlock the Power of LLM Evaluation with Lighteval 🚀

🔑 Key Features

⚡️ Installation

🚀 Quickstart

🙏 Acknowledgements

🌟 Contributions Welcome 💙💚💛💜🧡

📜 Citation

About

Uh oh!

Languages

Uh oh!

License

Uh oh!

filbench/lighteval

Folders and files

Latest commit

History

Repository files navigation

Evaluation Runner for Filbench

⌛ Set-up and Installation

🔎 Inspecting a task

▶️ Running a task

🆕 Implementing a new task

Unlock the Power of LLM Evaluation with Lighteval 🚀

🔑 Key Features

⚡️ Installation

🚀 Quickstart

🙏 Acknowledgements

🌟 Contributions Welcome 💙💚💛💜🧡

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages