MLCommons Releases New MLPerf Inference v6.0 Benchmark

SAN FRANCISCO, April 01, 2026 (GLOBE NEWSWIRE) -- Today, MLCommons^® announced new results for its industry-standard MLPerf^® Inference v6.0 benchmark suite. This release includes several important advances that ensure the benchmark suite tests current, real-world scenarios for AI deployments and delivers a comprehensive picture of AI system performance.

Five of the eleven datacenter tests in MLPerf Inference v6.0 are new or updated, and the release also includes a new object-detection test for edge systems. The major changes include:

●       A new, open-weight large-language model benchmark based on GPT-OSS 120B that can be used for mathematics, scientific reasoning, and coding;
●       An expanded DeepSeek-R1 advanced-reasoning benchmark, including an interactive scenario that permits speculative decoding;
●       DLRMv3, the third generation of our recommender benchmark and now the first sequential recommendation benchmark test in the suite, which is thoroughly modernized based on generous engineering contributions from Meta, a world leader in recommender systems;
●       The suite’s first text-to-video generation benchmark;
●       A new vision-language model (VLM) benchmark that transforms unstructured multimodal data from Shopify’s extensive product Catalog into structured metadata;
●       An upgraded single-shot object detection benchmark for edge scenarios based on Ultralytics’ YOLOv11 Large model.

“This is the most significant revision of the Inference benchmark suite that we’ve ever done,” said Frank Han, Technical Staff, Systems Development Engineering at Dell Technologies and MLPerf Inference Working Group Co-chair. “The decision to update so many benchmarks in this round was prompted by the extraordinary enthusiasm and collaboration from our members, who contributed an unprecedented amount of engineering effort and IP toward building new inference benchmarks. Adding these new tests allows MLPerf Inference to better keep pace with the breakneck pace of evolution in AI models and techniques so that our benchmarks are relevant and representative of real-world deployments.”

The open-source MLPerf Inference benchmark suite measures system performance in an architecture-neutral, representative, and reproducible manner. The goal is to create a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. The published results provide critical technical information for customers who are procuring and tuning AI systems.

“We thank Meta, Shopify and Ultralytics for their substantial collaboration with us in making these changes to the MLPerf Inference benchmark suite and for contributing their datasets, task definitions and expertise,” said Miro Hodak, Senior Member of Technical Staff at AMD and MLPerf Inference Working Group Co-chair. “These partnerships were essential in ensuring that the tests include scenarios and workloads that represent the current state of the industry.”

"MLPerf Inference benchmarks play a vital role in driving transparency and accountability across the AI industry,” said Glenn Jocher, CEO & Founder of Ultralytics. “At Ultralytics, rigorous, reproducible benchmarking is central to how we develop and validate our Ultralytics YOLO models — ensuring developers and organizations can make informed decisions about real-world performance. We're proud to be part of an ecosystem that holds the entire field to a higher standard."

"Commerce is one of the most complex domains in AI, yet researchers rarely have data that reflects that complexity," said Kshetrajna Raghavan, Principal Engineer, Applied ML at Shopify. "Shopify is uniquely positioned to address this, sitting at the intersection of millions of merchants and billions of products. Sharing this taxonomy allows the whole field to evolve."

New tools for submitters and consumers

With Inference 6.0, submitters have the option to use a newly available harness to complete benchmark tests. The new system, LoadGen++, allows LLMs to run with a serving-style software stack, which is familiar from typical deployments today. “LoadGen++ is a significant upgrade from its predecessor, and represents an important investment by MLCommons that will allow us to stay nimble as we continue to produce benchmark tests that track the state of the art,” said Han.

In addition, the Inference 6.0 results can be viewed in a new online dashboard https://mlcommons.org/visualizer on the MLCommons site. The dashboard brings new levels of interactivity to viewing results, including advanced filtering and customized performance graphs.

Large-scale, multi-node systems gaining attention

The submissions to Inference 6.0 demonstrate that technology providers want to showcase the performance of scaled-up, multi-node systems running real-world inference workloads. This round recorded a new high for multi-node system submissions, a 30% increase over the Inference 5.1 benchmark six months ago. Moreover, 10% of all of the submitted systems in Inference 6.0 had more than ten nodes, compared to only 2% in the previous round. The largest system submitted in Inference 6.0 featured 72 nodes and 288 accelerators, quadrupling the number of nodes in the largest system in the previous round.

“As more AI applications have moved into production and wide availability, the demand for large-scale, high-performance systems to run them has grown,” says Hodak. “At the same time, multi-node systems bring a unique set of technical challenges beyond those of single-node systems, requiring configuration and optimization of system architectures, network interconnects, data storage, and software layers. Stakeholders are eagerly stepping up to meet these challenges and run inference workloads at scale.”

The AI community continues to embrace and invest in MLPerf Inference

The MLPerf Inference 6.0 benchmark received submissions from a total of 24 participating organizations: AMD, ASUSTeK, Cisco, CoreWeave, Dell, GATEOverflow, GigaComputing, Google, Hewlett Packard Enterprise, Intel, Inventec Corporation, KRAI, Lambda, Lenovo, MangoBoost, MiTAC, Nebius, Netweb Technologies India Limited, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Stevens Institute of Technology, and Supermicro.

“I would like to welcome our first-time submitters, Inventec Corporation, Netweb Technologies India Limited, and Stevens Institute of Technology,” said Han. “The AI ecosystem is large and diverse, and it continues to grow and evolve rapidly. On behalf of MLCommons, I want to also thank our members, our contributors, and our partners including Meta, Shopify and Ultralytics for collaborating with us to build and shepherding forward the most comprehensive and relevant performance benchmark suite for AI inference. Together, we are ensuring that stakeholders in our community have valuable, real-world information that helps them to make better decisions.”

View the results

To view the results for MLPerf Inference v6.0, please visit the benchmark results dashboard https://mlcommons.org/visualizer.

About MLCommons

MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 130 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly grew into a set of industry metrics for measuring machine learning performance and promoting transparency in machine learning techniques. Since then, MLCommons has continued to use collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve the accuracy, safety, speed, and efficiency of AI technologies.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or email participation@mlcommons.org.

Press Inquiries: Contact press@mlcommons.org

MLCommons Releases New MLPerf Inference v6.0 Benchmark Results

The most significant update to the benchmark suite to date, with new tests ensuring that it remains the most comprehensive measure of AI system performance

Mot-clé