This project provides a unified framework to test generative language models on a large number of different evaluation tasks.
该项目提供了一个统一的框架来测试大量不同评估任务的生成语言模型。
Features: 特征:
- Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
LLMs 超过 60 个标准学术基准,并实施了数百个子任务和变体。 - Support for models loaded via transformers (including quantization via AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
支持通过 Transformer(包括通过 AutoGPTQ 量化)、GPT-NeoX 和 Megatron-DeepSpeed 加载的模型,并具有灵活的标记化不可知接口。 - Support for fast and memory-efficient inference with vLLM.
支持使用 vLLM 进行快速且节省内存的推理。 - Support for commercial APIs including OpenAI, and TextSynth.
支持商业 API,包括 OpenAI 和 TextSynth。 - Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library.
支持对 HuggingFace 的 PEFT 库中支持的适配器(例如 LoRA)进行评估。 - Support for local models and benchmarks.
支持本地模型和基准。 - Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
使用公开提示进行评估可确保论文之间的可重复性和可比性。 - Easy support for custom prompts and evaluation metrics.
轻松支持自定义提示和评估指标。
The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.
语言模型评估工具是 🤗 Hugging Face 流行的 Open LLM 排行榜的后端,已在数百篇论文中使用,并被包括 NVIDIA、Cohere、BigScience、BigCode、Nous Research 在内的数十个组织内部使用和马赛克机器学习。