2,485 Detailed Items Reflecting Real-World Work Environments
Simultaneous Comparison of Up to Five Models, Support for 12 Languages
Data Samples and Leaderboard Released on Global Open-Source Platform

On September 25, Samsung Electronics unveiled a new evaluation metric, TRUEBench, which quantifies how much artificial intelligence (AI) actually helps in real-world work environments.


TRUEBench is an evaluation standard developed by Samsung Research, the advanced research and development organization of Samsung Electronics' DX Division, based on in-house experience with generative AI. While previous evaluations focused on simple Q&A tasks primarily in English, TRUEBench is designed to assess a wide range of practical office tasks, including document summarization and translation, data analysis, and continuous conversation. It consists of 10 categories, 46 work types, and 2,485 detailed items.


Samsung Electronics explained that up to five models can be compared at once, and not only the accuracy of answers but also their length and efficiency are released, enabling more granular evaluation. The supported languages include English, Korean, Japanese, Chinese, Spanish, and a total of 12 languages, allowing users to check cross-translation performance across different language combinations.


Additionally, Samsung Electronics released TRUEBench data samples and a leaderboard on the global open-source platform Hugging Face. The evaluation process adopts a cross-validation method in which AI re-examines human-created standards, thereby enhancing objectivity and consistency.



Jeon Kyunghoon, Head of Samsung Research and CTO of the DX Division, stated, "Based on a variety of real-world applications, we have secured differentiated productivity AI technology and expertise. By releasing TRUEBench, we aim to set a global productivity evaluation standard and strengthen our technological leadership."


This content was produced with the assistance of AI translation services.

© The Asia Business Daily(www.asiae.co.kr). All rights reserved.

Today’s Briefing