Leaderboards

SEAL LLM Leaderboards evaluate frontier LLM capabilities. These leaderboards provide insight into models through robust datasets and precise criteria to benchmark the latest AI advancements.

Start Exploring→

Methodology→

Humanity's Last Exam

Challenging LLMs at the frontier of human knowledge

18.81 ±1.47

9.15 ±1.09

8.93 ±1.08

View Full Ranking →

Humanity's Last Exam (Text Only)

Models evaluated on text-only HLE questions

18.57 ±1.57

13.97 ±1.40

11.1 ±1.26

View Full Ranking →

EnigmaEval

Evaluating model performance on complex, multi-step reasoning tasks

6.14 ±1.02

o1 (December 2024)

5.65 ±0.52

4.23 ±0.45

View Full Ranking →

MultiChallenge

Assessing models across diverse, interdisciplinary challenges

51.91 ±0.99

51.58 ±1.98

49.82 ±1.36

View Full Ranking →

VISTA

Vision-Language Understanding benchmark for multimodal models

54.65 ±1.46

48.23 ±0.70

47.32 ±1.78

View Full Ranking →

Agentic Tool Use (Enterprise)

Evaluating AI agents' ability to use enterprise tools effectively

o1 (December 2024)

70.14 ±5.32

68.75 ±5.35

67.01 ±5.43

View Full Ranking →

Agentic Tool Use (Chat)

Assessing chatbots' proficiency in leveraging external tools

o3-mini (high)

63.45 ±5.52

62.43 ±6.76

o3-mini (medium)

62.42 ±6.76

View Full Ranking →

Frontier AI Evaluations

We conduct high-complexity evaluations to expose model failures, prevent benchmark saturation, and push model capabilities —while continuously evaluating the latest frontier models.

Scaling with Human Expertise

Humans design complex evaluations and define precise criteria to assess models, while LLMs scale evaluations—ensuring efficiency and alignment with human judgment.

Robust Datasets

Our leaderboards are built on carefully curated evaluation sets, combining private datasets to prevent overfitting and open-source datasets for broad benchmarking and comparability.

Show Legacy Leaderboards

Run evaluations on frontier AI capabilities

If you'd like to add your model to this leaderboard or a future version, please contact leaderboards@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.