News

Søren Riis and Marc Roth are joint winners of "Humanity's Last Exam" Competition

Centre for Fundamentals of AI and Computational Theory

20 March 2026

A robot taking an exam. Image by Mohamed Hassan from Pixabay

Søren Riis and Marc Roth are both joint winners of the SafeAI Benchmark Competition "Humanity's Last Exam".

The linked paper "A benchmark of expert-level academic questions to assess AI capabilities", co-authored by FACT researchers Søren Riis and Marc Roth, has been published in Nature. This paper accompanies the benchmark set created in Humanity's Last Exam (HLE), an initiative designed to push artificial intelligence to its limits by challenging it with expert-level questions.

AI systems are typically evaluated based on benchmark questions that assess their intelligence and performance. However, as AI models have rapidly advanced, existing benchmarks have become too easy. The HLE competition aimed to change this by curating a new benchmark set of exceptionally difficult questions.

The competition attracted more than 1,000 researchers and experts, who submitted questions spanning over 100 subjects. The selection process involved three stages:

1. AI Evaluation: five of the best AI models (late 2024) attempted each question. If all failed, the question advanced.

2. Expert Review: experts refined and assessed the questions and answers.

3. Final Selection: a panel of experts and organisers made the final call.

Out of over 70,000 submitted questions to stage 1, only 2,500 made it into the final benchmark, with the top 50 declared as winners, each earning a prize. All contributors were invited to join the paper accompanying the competition as co-authors.

Søren and Marc were the sole participants from QMUL. Both contributed multiple questions, and both are joint winners of the competition. Moreover, one of Marc Roth's questions has further been selected to be featured in the Nature paper.

At the time the first version of the benchmark set had been finalised (early 2025), the best performing AIs were Open AI o1 and Deepseek R1 which answered, respectively, 8% and 8.5% of the questions correctly. One year later, Gemini 3 Pro achieved a staggering 38.3%. In fact, the true performance might be even better since the benchmark set might still contain a small fraction of ambiguous questions and questions where the given expert answers are partially incomplete or incorrect, mainly in the areas of text-only chemistry and biology questions. The HLE team has therefore transitioned to a dynamic rolling basis for quality control and improvement over the coming years.

People: Soren RIIS Marc ROTH

Updated by: Paul Curzon