[AI Data Shortage Crisis] 'If Data Is Insufficient, Make It Yourself?'... Synthetic Data Gaining Attention
AI Boom Outpaces Data Supply Demand
Growing Interest in Synthetic Data Created Virtually
Negative Views on Performance Decline and Lack of Diversity Also Present
As concerns arise that securing data necessary for artificial intelligence (AI) training may face limitations, artificially generated synthetic data is gaining attention. This involves using fictitious data for AI training, but there are also negative opinions suggesting potential performance degradation.
According to the '2023 Data Industry Status Survey Report' released last month by the Korea Data Industry Promotion Agency, the domestic data industry market was estimated at 27.1513 trillion KRW last year, growing 4.6% compared to the previous year. Until 2018, the market size was about 15.5684 trillion KRW, but it increased by more than 11.5 trillion KRW in five years. The domestic data industry market is expected to grow at an average annual rate of 12.6%, with the market size projected to approach 51.1413 trillion KRW by 2028. The global market research firm 360iResearch forecasted that the market size of training datasets used for AI model development worldwide will grow by more than 26% annually.
Interest in synthetic data appears to reflect concerns that supply may not keep up with data demand.
Synthetic data is virtual data created for AI training and is broadly divided into 'partial' and 'fully' synthetic data. Partial synthetic data is created by applying synthetic information to parts of real data. It is useful for protecting sensitive information.
Fully synthetic data means generating entirely new information. Although fictitious, it can use statistical properties identical to real data, allowing conclusions similar to those obtained using actual data.
Proponents of synthetic data highlight the ability to generate unlimited data as needed. They emphasize that data can be provided in sensitive fields such as finance and healthcare where personal information is critical. The global market research firm Gartner predicted that by 2030, the proportion of synthetic data used for AI training will surpass that of real data. For example, the use of synthetic data is increasing in autonomous driving model development. This is because it is difficult to secure actual traffic accident data, and synthetic data can also enable 3D rendering.
Hwang Min-young, Vice President of SelectStar, a domestic AI data startup, said, "As data that can be collected by conventional methods gradually depletes, reliance on synthetic data is expected to increase."
Since synthetic data is artificially created, there are also negative views. Because it is not real, quality issues may arise. Moreover, if poorly designed synthetic data is used for AI training, it is highly likely that it will fail to accurately reflect reality. If erroneous data is reproduced and used in the AI field, it can lead to performance degradation, distortion, and hallucination phenomena where AI models provide inaccurate answers.
Hot Picks Today
"Rather Than Endure a 1.5 Million KRW Stipend, I'd Rather Earn 500 Million in the U.S." Top Talent from SNU and KAIST Are Leaving [Scientists Are Disappearing] ①
- "Not Jealous of Winning the Lottery"... Entire Village Stunned as 200 Million Won Jackpot of Wild Ginseng Cluster Discovered at Jirisan
- "I'll Stop by Starbucks Tomorrow": People Power Chungbuk Committee and Geoje Mayoral Candidate Face Criticism for Alleged 5·18 Demeaning Remarks
- Iranian Military Spokesperson: "Ceasefire Was an Opportunity to Strengthen Forces... Ready to Respond to War"
- "How Did an Employee Who Loved Samsung End Up Like This?"... Past Video of Samsung Electronics Union Chairman Resurfaces
Kim Myung-joo, President of the International Artificial Intelligence Ethics Association and Director of the Barun AI Research Center at Seoul Women’s University, explained, "There are experimental results showing that when next-generation AI models use synthetic data created by AI, their performance may decline compared to before," adding, "If AI models using synthetic data dominate, diversity may be lost." She further emphasized, "There is also a need for vigilance regarding the possibility that AI could homogenize human civilization."
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.