"170 Types of AI Training Data on Dialects, Autonomous Driving, and Cancer Diagnosis Released... Accelerating AI Innovation"

Published 18 Jun.2021 10:02(KST)

Updated 15 Mar.2023 19:35(KST)

open/close

"170 Types of AI Training Data on Dialects, Autonomous Driving, and Cancer Diagnosis Released... Accelerating AI Innovation"

[Asia Economy Reporter Eunmo Koo] Korean dialect speech data will be made available to enable artificial intelligence (AI) voice assistants to accurately recognize regional dialects such as those from Gyeongsang-do, Jeolla-do, and Jeju-do. In addition to sports motion data useful for correcting golf swings and lane recognition videos essential for autonomous driving technology development, medical imaging data that can improve the diagnostic accuracy of dementia, cancer, and skin diseases will also be released. As high-quality training data flows into the industry, it is expected to accelerate innovation in the domestic AI industry.

On the 18th, the Ministry of Science and ICT and the National Information Society Agency (NIA) announced that 170 types of AI training data will be sequentially opened from today through the AI integrated platform ‘AI Hub.’ The AI training data construction project, promoted as part of the ‘Data Dam’ initiative, a key project of the Digital New Deal, is a government-led effort to build large-scale data necessary to improve AI performance and make it accessible to anyone.

This release significantly increases the volume of training data available for use by the AI industry. By field, the data includes 39 types of speech and natural language (including Korean dialects), 32 types of healthcare data (such as cancer diagnostic images), 21 types of autonomous driving data (road driving videos), 15 types of vision data (sports motion videos), 12 types of land and environment data (forest species images), 14 types of agriculture, livestock, and fisheries data (livestock behavior videos), 19 types of safety data (images of aging facilities), and 18 types of other data (fashion product images), totaling 170 types (480 million cases) across eight fields.

Notably, the release includes large-scale Korean speech data, domestic road driving video data, and medical imaging data for major cancer and disease diagnosis, which are difficult to build on a large scale in the private sector. Until now, the industry mainly relied on overseas open data, which often failed to properly reflect the Korean language or domestic conditions, limiting AI development. The Ministry of Science and ICT stated, “Most of the AI training data construction process involves repetitive manual work, incurring significant time and costs. It has been difficult for small and medium-sized enterprises, startups, and even large companies to build large-scale data independently.”

The government plans to open 60 types initially and sequentially release the rest by the end of June. Healthcare data (27 types) and 59 other datasets that may contain personal and sensitive information will be opened after final verification on the 30th. A total of 674 companies and institutions participated in this data construction, including major domestic AI and data specialized companies, 48 leading universities such as Seoul National University and KAIST, and 25 hospitals including Seoul National University Hospital and Asan Medical Center.

Ko Yoon-seok, Director of the Intelligent Data Division at NIA, explained, “Since AI companies and researchers find it difficult to individually produce training data due to time and cost constraints, the construction focuses on core data with high industry demand to reduce the burden on the industry and promote AI industrial development.”

As the industry’s data thirst is partially quenched, it is expected to aid innovation in the AI industry. For example, it will address issues in voice-based AI services that struggled to recognize dialects by providing regional dialect speech data, thereby increasing public satisfaction. The newly added autonomous driving data includes not only domestic road driving videos but also videos recognizing parking obstacles, moving objects, and bus route driving, which is expected to accelerate autonomous vehicle development. An industry insider evaluated, “Including various objects such as obstacles, special lanes, and potholes that are difficult to collect independently will greatly help autonomous driving technology development.”

Song Kyung-hee, AI Policy Officer at the Ministry of Science and ICT, said, “The opened data will contribute to improving the validity of AI models for companies that have struggled with performance improvement due to data shortages, and cases of system and service advancement through data utilization will continue to accumulate.”

The data released this time is also evaluated to have achieved qualitative improvements compared to before by establishing a full-cycle quality management system in cooperation with quality management organizations such as the Telecommunications Technology Association (TTA). Since September last year, the Ministry of Science and ICT and NIA have operated a ‘Quality Advisory Committee’ involving over 80 experts from industry, academia, and research across eight fields to build a professional quality management support system. Major large companies, startups, universities, and research institutions participated in reviewing usability before data release to ensure data quality meets actual user demands.

After data release, the Ministry of Science and ICT and NIA plan to continuously improve the data based on public-private cooperation by actively reflecting user feedback through a three-month intensive improvement period involving user participation until the end of September. Additionally, the ‘Artificial Intelligence Data Utilization Council’ will be launched today. The council is composed mainly of TTA and companies and institutions that participated in the usability review of the 170 datasets, aiming to actively utilize AI Hub data, share and disseminate achievements, and cooperate to enhance data quality and continuous improvement.

User Participation-Based Quality Improvement System for AI Training Data (Draft)

Minister Lim Hye-sook of the Ministry of Science and ICT emphasized, “Just as dam water seeps into the land and makes flowers bloom everywhere, we hope that the data released this time will be widely used across industries and bear the fruits of innovation,” adding, “The government will continue to provide high-quality AI training data and spare no support to create an environment where anyone can easily utilize data and share results.”